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(54) System for backing up files from disk volumes on multiple nodes of a computer network 

(57) A system for backing up files from disk volumes 
. on multiple nodes of a computer network to a common 
I random-access backup storage means. As part of the 
, backup process, duplicate files (or portions of files) may 
be identified across nodes, so that only a single copy of 
the contents of the duplicate ties (or portions thereof) is 
j stored in the backup storage means. For each backup 
.operation after the initial backup oa a particular volume, 
u only those files which have changed since the previous 
v: backup are actually read from the volume and stored on 
the backup storage means. In addition, differences 
.. between a file and its version in the previous backup 
may be computed so that only the changes to the file 
heed to be written on the backup storage means. All of 
these enhancements significantly reduce both the 
amount; of storage and the amount of network band- 
width required for performing the backup. Even when 
the backup data is stored oh a shared-file server, data 
privacy can be maintained by encrypting each file using 
a key generated from a fingerprint of the file contents, 
so that only users who have a copy of the file are able to 
produce the encryption key and access the ffle con- 
tents. To view or restore files from a backup, a user may 
mount : the backup set as a disk volume with a directory 
structure identical to that of the entire original disk vol- 
ume at the time of the backup. 
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Description ^ 

The present invention relates to a ^OT^allbtf-- s 
ing multiply nodes on a computer net^rt^ to backup 
files to a common random-access bkckup storage 
means. 

Background of the Invention 10 

Backing up data and program files (often together 
referred to as "data"~here) from computer disks-has 
been a w&l ; krown practice fdr -marV,y^^v-1^^e.ar,e-: 
two major reasons why data needs to be backed up. is 
The first reason is that the disk hardware may fail, 
resulting in an inability to access any of the Suable 
data stored on the disk. This disastrous type of event is 
often referred to as a catastrophic failure; in this case, 
assufflir^ fhiat tai&ups have been performed, the com- 20 
puter operator typically "restores" all his files from the 
most recent backup. Fortunately, new computer disks 
and controllers have become more reliable over the 
years, but the possibility of such a disaster still cannot 
be igrwred. Th^ sScdid reason for badoip is that Users 25 
may inadvertently^ delete or overwrite important data 
files. This type of problem is usually much more com- 
mon than a catastrophic hardware failure, and the com- 
puter operator typically restores only the destroyed files 
from the backup medium (e.g., tapes) to the original 3d 
disk. 

In general, the backup device is a tape drive, 
although floppy disk drives and other removable disk 
drive technologies (e.g., Bernoulli, Syquest, optical) are 
also used. Tape has the advantage of having a lower 35 
cost per byte of storage (when considering the cost of 
the media only, ignoring the cost of the drive), and fbr 
that reason tape is preferred in most applications, par- 
ticularly those where large amounts of data are 
involved, such as network file servers. Tape is primarily 40 
a sequential access medium; random accesses, while 
possible, usually require times on the order of tens of 
seconds (if not minutes), as opposed to milliseconds fbr 
a disk drive. Similarly, the time to stop and restart a 
moving tape is on the order of seconds, so it is impor- 45 
tarn to supply enough data to keep the tape drive 
"streaming" in order to insure acceptable backup per- 
formance. After a backup is completed, the tape car- 
tridge may be taken off-site for safe keeping. When the 
need arises to restore data from a given backup, the so 
appropriate tape cartridge is re-inserted into the tape 
drive, and the user selects the file(s) to be restored, 
which are in turn retrieved from the tape and written to a 
disk volume. 

The tasks of physically storing the set of tape car- ss 
tridges in a safe environment and cataloging them to 
facilitate selection of the tape(s) required fbr restore are 
important (and often challenging) functions of both the 
backup software and the backup administrator (i.e.. the 



individual(s) responsible for implementing the backup 
process and policy) - In addition, if the backup or restore 
operations involve rhtjWple tapes. me ability to switch 
between tapes must be provided, either manually by a 
backup administrator or, automatically^ jukei- 
box (i.e., a robot c tape autochanger). Switching 
between tapes thus can involve a considerable^ direct 
cost either for salary or for jukebox robotics, as well as 
a substantial time delay, normally tens of seconds or 
more. ''• ' ' <t/l 1 : ^ v ^ * ■ ■ : ' ' ;k " " * : 

In order to save backup^me as well as the amount 
of tape used, various types of "iricrementar backup 
strategic iriay beserrvloyed;-For example, a common 
practice Invqtyes performing a full, badtup of all files on 
a disk volume once per week, and then backing up only 
the files that have changed since the last backup on 
subsequent dayis of the week. Another variation on this 
idea is known as "differential" backup, in which each 
partial backup contains all changes since the last full 
backup instead of from the previous partial backup; this 
method guarantees that only two backups (one full and 
one partial) need to be accessed to restore files as of 
the time of a particular backup. Since in most cases the 
amount of data that actually changes on a disk volume 
per day is asm&jl fraction of the total, ^u^ app^ch^s 
have the advantage of significantly reducing the backup 
"window", or amount of time required for a backup, on 
the days when &n incremental is performed Also, h is 
often possible tb fit the data from a till t^ 
eral incremental backups all on a single tape cartridge, 
obviating the need for any tape switching in the days 
intervening between full backups. In the case where the 
disk volume and the tape drive are on separate comput- 
ers connected over a network, incremental backup also 
considerably decreases the network bandwidth require- 
ments. 

While it is true that incremental backups can save 
time and media! they are also often much harder to use 
than full backups. From a user's perspective; the set of 
files included on each incremental backup is normally 
quite unrelated to how he views the contents of his disk 
volume. In other words, although certain files may have 
changed since the last backup, the disk volume still con- 
tains a complete copy of all files, changed and 
unchanged, any or which may be required to perform a 
given operation. Unfortunately, the restore software 
dealing with incremental backups in the prior art typi- 
cally presents to the user a view of only the changed 
files, not a merged view of all files present on the disk at 
the time of the incremental backup. Thus, for example, if 
a user wishes to restore a given set of files, say an 
entire subdirectory, as of the date of a given incremental 
backup, ha often will have to restore the fifes from the 
previous full backup and then each of the intervening 
incremental in order to guarantee the correct latest" 
copy of each file. Similarly, if the user wishes to identify 
a set of fUes from the backup tapes, he normally must 
peruse several incremental/full backup sets in order to 
find all the files of interest. Once the files have been 
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|: selected, they may very weil be spread all over the tape, is not surprising that many systems utilize a mixture of 

I-. even if they were all contiguous on the disk and the full 1 approaches that evolves over time as technology? • 

f backup, thus resulting in a very slow restore process ! progresses. 

For these reasons, incremental backup is oiftenVused Personal ODnrtputer wort«t^ons on rietworks al^}; : ^^^ r 

grudgingly at best, and not infrequently it is rejected itf 5 often contain critical data The amount of such data is 

fevor of always peilbrming full teuaMes^ . incr^htf-W'thi' averse wortefeten disk capacity : ' 

Another significant limrtation in perfbrmir^ rest grows, due to'the cohtM decreases in the 

has to do with how the us£r may access files stored on ^ cost of disk drive Capacity. Nonetheless, the date from 

J ' the tapes. Restoring files typically involves running a workstations on most networks today is backed up only ' 
special application, provided as pail of the backup soft- 10 sporadically/ *rf at all, despite the availability of several 
ware package, that allows the user to select his files and seemingly viable solutions. For example; , installing a 
then restore them from tape to a disk volume. Because v: standalone tape drive on each' workstation could solve 
the user runs the restore application infrequently/ it the problem In theory, but in practice there are many ; 
presents an unfamiliar interface to dealing with files and serious obstacles that limit the effectiveness of this j 
does not allow accessing the files directly with other 15 approach. Among these problems are drive cost t media \ 
application programs. Most users already have their cost, end-user training cost, and the difficulty of manag- } 
own favorite set of applications for viewing and dealing ing the backup tapes which are necessarily dispersed J 
with files, including word processors, file managers, physically throughout the organization, not to mention jjj 
spreadsheets, etc., so the concept of "mounting" the the lack of user discipline in regularly performing | 
backup image as a pseudo-disk volume to allow the 20 acceptable backups. v , j 
user to view, select, and restore files using his own tools • ../ Another method, included as part of almost every f 
seems attractive and has been:implemerTted : -in a few : network backup software package, allows the worksta- ' 
cases (e;g.. Columbia Data's Snapback product; U.S. ton data to be backed up over the network to a shared j 
Patent Application entitled "SYSTEM FOR BACKING backup device, thus permitting a centralized administra- 
te COMPUTER DISK VOLUMES." filed October 4, 25 tion of drives and media and amortizing the hardware 
1995, and assigned to^ cost among many users. However, despite its ready 
invention, incorporated herein by reference). However/ availability, this technique is employed in only a small 
the inherent slowness of tape in such random access percentage of installations, for a variety of reasons. For [ 
applications makes the usefulness of this concept example, it is not uncommon for networks to "contain f 
somewhat limited, and this awkwardness is particularly 30 dozens or even hundreds of workstations, in which case 
exacerbated if incremental backups have effectively there may not be enough network bandwidth to backup J 
spread the files out even further on the tape than they all of them in a reasonable window (e.g.; overnight), 
would be in a full backup. -u Further; the sheer volume of data involved typically 
For a single standalone computer, the configuration forces the use of a tape jukebox, which greatly 
for backup consists of adding a backup device (e.g . 35 increases the cost of hardware and of tape manage- 
tape drive) to the computer. In a networked environ- merit. Also, there can be cdnfficts in scheduling the use 
ment, however, the situation is much more complex, and of the tape drive for each wortetaticm; particutariy since 
many configurations have been employed iri an attempt there is a need on one hand to provide data at a high 
to address the intricate tradeoffs in cost manageability. enough rate to keep the tape drive screaming, but on | 
and bandwidth (both tape and network). Most computer 40 the other hand the act of backing up typically slows | 
networks include nodes that are tie or application serv- down user response and network response. One sys- | 
ers as well as nodes that are user workstations, (e.g., tern, patented by Gigatrend in : U:S. Patent No. I 
desktop personal computers). Servers generally con- 5,212,772, worked around part of this problem by inter* 
tain critical data for an entire company or department, leaving the data from multiple workstations onto a single 
so backing up the servers) is considered an imperative 45 tape cartridge in an attempt to guarantee that thetape I 
task and is normally handled by a network admmistra- could continue streaming, but this approach met with lit* j 
tor. It is not uncommon for each server to have a dedi- tie commercial success. Perhaps the major obstacle to \ 
cated tape drive for backing up its disk(s). but in many acceptance is the simple fact that, once a user decides ■ 
instances a single tape drive may be used to back up that he needs to restore date from a previous backup of 
multiple servers by sending the backup data over the so his workstation, he does not have control of the media 
network. The former approach is more expensive and to select the tape(s) of interest, unless a very large (and 
involves managing tape cartridges at more locations, expensive) tape jukebox is used and appropriate soft* 
but it avoids network bandwidth limitations in the latter ware is available to manipulate the jukebox remotely, 
approach that often make.h impossible tokeep.ahigh- Manually assisting the user in this task rarely rates very 
speed tape drive streaming with data coming over the ss high on the priority list of a network administrator, and it 
network. Given the complexities of the various factors may also conflict with other scheduled uses of the tape 
• • ■ involved, including drive cost, media cost,, tape drive drive(s); the result is almost invariably that the delay in 
speed, network bandwidth, frequency of backup, size of finally accessing the data to be restored leaves both the 
hard disks, acceptable range of backup window, etc., it end-user and the network administrator frustrated. 



3 



BNSDOCI& <EP 077471 6At_L> 



i the everd ^easiri&cd^ of d iskclri ve capacity, 
another solution to the workstation backup problem has i 
beeni recently employed in some networks. The network 
administrator; adds a large disk drive (or set of drives) to 
a file seiyer on the network, and users s^rn^v^c^>y files 
from, their workstations to a subdirectory tree on the 
new disk. If desired, privacy of the backup data can be 
insured by assigning standard network security access : 
rights to each user's directory. The files placed on the 
server- are backed up as part of the regular server 
backup process, providing a second level of data recov- 
ery if necessary. Each user can easily access the files 
from his -ted^^iri^rY on the network using his own 
preferr^ plications, V^hout any irrtervehtion on the 
part oftfWne^ 

approach could also be applied to server backup rf 
desired. At current prices, it is possible to add one giga- 
byte of disk space for each user for a price comparable 
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is 



that of a low-end tape drive; while this solution may be 
more expensive on a cost per megabyte bakis than oth- 
ers discussed here, the cost is nonetheless acceptable 
in certain environments. This method is not without its 
proWerm;;such as network bandwidth constraints, need- 
for user discipline in regularly backing up all important , 
files, and inability to retrieve older versions offiles with- 
out accessing a tape, which typically require 
trator assistance^ However, it does overcome some key 
obstacles which are not easily addressable using a 
tape^y solution. 

It is readily observed that most workstations on a 
network contain many files with identical contents, par- 
ticularly operating system files, program files, and other 
files that are distributed as part of software packages, 
stored on the user's disk, and never modified. It is also 
seems to be true that the percentage of disk contents 
occupied by such common files is increasing with time, 
particularly asdisk drive capacity grows and more soft- 
ware is distributed oh CD-ROMs. However, observe that 
hone of the prior art backup approaches discussed 
above take advantage of these phenomena in any way. 

gummas Qf trie Invention 

It is the goal of the present invention to overcome 
many of the problems historically associated with back- ■ 
ing up data from multiple nodes on a computer network. 
In contrast to -the prior art the present invention pro- 
vides a lower-cost backup solution that simultaneously 
reduces network bandwidth consumption, decreases 
the time required for backup and restore, allows for cen- 
tral administration, automates the backup process at 
user workstations, provides access to ail versions of 
previous files without any administrator intervention, 
and permits the user to access files from the backup 
directly using his own applications. 

In the present invention, files are backed up from 
disk volumes on multiple nodes of a computer network 
to a common random-access backup storage means, 
typically a disk volume. Backups can be scheduled, 
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either, by the user or. by the backup administrator, to 
occur automatically and independently for each node. 
As part of the backup process, duplicate files (or por- 
tions of files) may be identified across nodes, so that > 
only a single copy of duplicate files 

(or portions thereof) is stored in the backup, storage 
means. The preferred ernbodirnent includes a search 
method for identifying that is, extremely - : 

etfidert;^^ 

lions of files have been added to the backup system. For 
each backup operation after the initial backup on a par- 
ticular volume, only those files which have changed 
since the previous backup need to be read from the vol- 
ume and stored on the backup storage meare; pointers r 
.to thexttnterrte^ 

the directory information i for "the backup In addition^drf- 
ferences between a file, and its version in the previous 
backup may be computed so that only the changes to 
the file need to be written on the backup storage means, 
and almost all data written to the backup storage means 
is compressed using a lossless compression algorithm. 
Each of these "data, reduction" enhancements signifi- 
cantly decreases both the amount of storage and the 
amount d network bandwidth required for performing 
the. backup. . In .fect, the data reduction is effective 
enough jn most instances to lower tiie.amount of stor- 
age required to the point where the system cost of using 
disk drives as the backup storage means is less expen- 
sive than the cost of conventional tape backup systems 
in such ewronments, particularly given the rapidly 
declining cost of disk storage. 

Even when the backup data is stored on an openly 
accessible shared-file server, data privacy can be main- 
tained by encrypting the contents of each file using an 
encryption key generated from a hash function of the file 
contents, so that only users who once backed up a copy 
of the file are able to produce the encryption key and 
access the file contents. 

To view or restore riles from a backup, a user may 
mount the backup set as a real-time (i.e., with disk 
access times, not tape access times) temporary disk 
volume with a directory structure identical to that of the 
entire original disk volume at the time of the backup. 
The user may then access the files directly using his 
own applications, without first having to copy them using 
a separate restore program. The backup disk volume 
may be mounted in a read-only mode; alternatively, 
write access can be provided to allow transient modifi- 
cations to the files, although all such modifications are 
normally lost once the backup volume is unmounted. 

Brief Description of the Drawings 

A preferred embodiment of the present invention is 
illustrated in and by the following drawings, in which like 
reference numerals indicate like parts and in which: 

FIGURE 1, is a block diagram illustrating a typical 
network configuration for the backup system of the 
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present invention; . - * ' * 1 
FIGURE 2 ts a diagram illustrating a typical direc- 
tory structure wh wbdckup files are : - f - )tv 
FIGURE 3 is a block diagram ^iDL^slmtingrtKe^ty^S/ 
of files contained in the backup directories -of the 
present invention; 

FIGURE 4 is a Backus-Naur Fwm (BNF) descrip- 
tion of the format of the ^irecXc^^^^^^B 
backup directory ffle in accordance with the present 
inverrtibn; ^ ; : ^>: : - f^l-^ 
FIGURE 5 is an assenr^y language^ 
specific example of the directory entry format 
defined in FIGURE 4; 

FIGURE 6 is a block diagram of the layout of a 
backup data file in accordance with the present 
invention; ; "■■ 

FIGURE 7 is a 8NF description of the format of a 
backup data file in accordance with the present 
invention; . - 

FIGURE 8 is a block diagram illustrating an exam- 
ple of a <sG6kPts> record in a backup data file, in 
accordance with the present invention; 
FIGURE :9 is; a diagram i iHustratinglthe/layout of a 
global directory database file in accordance with 
the present invention; - 
FIGURE 10 is a diagram illustrating the data struc- 
tures used in searching the first level of the global 
dir&oryd^^ 
invention; and 

FIGURE 1 1 is a diagram illustrating how user pass- 
words are used to access encryption keys to 
access backup data fn accordance with the present 
invention. 

Detailed Description of the Preferred Embodiment 



> ton services Such an embodiment also tends to con- ^ 
centratertKe^ 

files) on the server, which may affect the scalability of 
this approach/although there are simple ways to distrib- 
5 uter^ 

nodes if desired, which are readily apparent to those of 
ordinary skill in the art. There are also many other pos- 
sible ';er^^ consisting of various flavors of 
hybrids of the file server arid application' server 
w approaches, which would ;tatl within the scope of the 
present invention. 

In yet another embodiment, the backup storage 
means incx^porates hiemrc^icaivstomge rmnagement 
), in which files that have not been accessedfbrii 



is long time are migrated from disk to a secondary storage 
means, such as tape or optical disk. The main purpose 
of HSM is to save on storage costs for very large stor- 
age systems by providing the management tods that 
allow the migration to be transparent to the system; 

20 except for the additional delay in accessing some files. 
Use of any form of HSM in conjunction with the backup 
storage means of the present invention does not signifi- 
cantly affect any of the concepts discussed here. How- 
ever, care must be taken not to impair performance of 

25 the backup and restore operations, since delays 
incurred in accessing secondary stora^ may render 
the system much less usable. Indeed, it would be fairly 
simple to identify portions of the contents of backup 
data and directory files of trte present invention which 

30 could be ^rnigrated to se^hdary storage without 
adversely affecting bati^ in 
most cases, the data reduction methods of the present 
invention are sufficiently jwwerful to keep disk storage 
costs down to an acceptable level even without using 

35 HSM. - ■ ^ 



The preferred embodiment of the present invention 
uses disk space on a network file server (or servers) as 
its backup Storage means, Each client workstation is 
responsible for copying the backup data to a preas- 40 
signed location or directory on the file server, as well as 
for searching the backup "database" to identrfy dupli- 
cate files across users and to compute the differences 
(or deltas) between file versions. Thus, the preferred 
embodiment is not truly a client/server system, although -4s 
certain housekeeping functions essential to perform- 
ance and security need to be performed by an Agent 
task, which may run on arty network node, including the 
file server itself. 

In an alternate embodiment, the backup storage so 
means consists of disk space oh an application server 

.; (the backup server). The network nodes communicate 
with the backup server in a traditional client/server par- 
adigm. The Agent functions are performed by the 

' backup server. This embodiment provides slightly 55 
higher security than the preferred embodiment but it 
normally costs more because of the need for a separate 
server, although it may be possible to amortize this cost 
somewhat if the backup server provides other applica- 



nt. Backup Process ■ - • . 

In the preferred embodiment as shown in FIGURE 
1 , the nodes to be backed up may be ^ther worksta- 
tions 102' desktop personal-^ 103; laptop 
computers 104, or other servers 105 on the network. All 
communication is accomplished by creating or modify- 
ing files over -the network 106 on the backup storage 
101 . As shown in FIGURE 2. each node is assigned two 
diredtories. a user directory and a system directory, on 
the^tfa^sstorage means 101; which is contained in 
the disk volumes of the network file server 100. The 
node has network write access to its user directory 
(e.g., \BAGKUP\USERS\USEFi2, 125 in FIGURE 2), 
where it posts backup data. A backup administrator con- 
figures the backup system using administrator software 
functions provided as part of the product A backup 
Agent process 108, which runs on a network node 107 
selected by the backup administrator, migrates the 
posted files to the system directory ;.. (e.g. 
\BACKUP\SYSTEM\USER2. 128 in; FIGURE 2). This 
system directory has network rights assigned to make it 
readonly to all nodes (except the Agent 108), so that no 
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user node can corrupt the migmted tmckup data, interi- 
tk>nally ; or inadvertently; ••Vyhile^jl-w^ldifnotrbe-'sfrictiy.-. 
necessary^ usetwodirec^ 
icanlly improved -In the shared-file environment of the 
preferred embodiment using this approach." V s 

In the preferred embodiment the Agent ;1 08 has 
read-write access to all the directories sriqwr^in;FiG- 
URE 2. Each user is given read-only access to all the 
directories ur^err\BACKUP\SYSTEM 122, but he has 
no .access to any of s the directories under 10 
VBACKUPMJSERS 121, other than his own directory 
(e.g \BACKUP\USERS\USER2, 1 25 in FIGURE 2), to 
which he has read-write, access. Limiting access to the : 
posting directories . in this fashion further increases 
sectdty.sfc is 
user's posted backup files before they are migrated] 
However, since in the preferred embodiment all backup 
files are encrypted and. Have checksums that can be 
used to detect corruption.it would also be possibie(arid 
probably easier, from the viewpoint of the network 20 
administrator) in an alternate embodiment to give all 
users read-write access to all directories under 
VBACKUPXUSERS 121 without significantly compromis-. 
ing security, assuming reasonably well-behaved users. 
The Agent 1 08 checks the> integrity of each bac^ file 25 
while migrating it; J.any errors ara 
are not migrated, thus maintaining the integrity of the 
data on the \BACKUP\SYSTEM directories. Observe 
that ftis general approach of the preterr^ errtf^m^ 
using network access rights and an Agent 108, results 30 
in much higher l^els of data security and jmegritythan 
in a conventional shared-file application, where each cli- 
ent node typically has full read-write access to the 
shared files, which are therefore much more susceptible 
to corruption., 35 

FIGURE 3 shows the main types of files (as well as 
some of their inter-relationships) created as part of the 
backup system of the preferred embodiment During the 
backup of a disk volume on a node, the backup process 
of the preferred embodiment separates all files on the 40 
source ; disk; volume into four categories: new, 
unchanged, updated, and modified. New files are those 
which did not exist on the same directory at the time of 
the. previous backup. Unchanged files existed at the 
time of the last backup and have not changed since that 45 
time (e.g. , they still have the same time, date, and size); 
Updated files are files that had been unchanged for, 
more than N y days as of the time of the previous 
backup, where ti u is a user-selectable option (typically 
in the range of 14-90 days), but which have been $0 
changed since the last backup. All other files are classi- 
fied as modified. When the first backup of a given vol- 
ume is performed, all files are classified as new. For 
each new or updated file, the backup software searches 
through a global directory database 145 for a matching 55 
file. The global directory database 145 is created and 
maintained by the Agent process 108 in the directory 
\BACKUP\SYSTEM\GLOBAL 127. Each time the Agent 
108 migrates a backup set from the \BACKUP\USERS 



path 121 to \BACKL) P\SYSTEM path 122, it searches 
for new and updated files in the backup set and adds 
them to the global diretfory database 1 45. rf a matching 
file is found in the database, a reference to the contents 
of that file is stored instead of, the file data itself,, as 
described below Similarly, for unchanged files,: only a 
reference to the previous file contents is stored. 

-In order to minimize search time and bandwidth,^ 
is believed preferable not to conduct a search thropgh 
the global directory database 1 45 for modified fdes For 
the same reasons, and to minimize the growth of the 
database, rtiqdrfied files are not added to the global 
directory database; 1 45. Instead, the contents of modi- 
fied files are stored in the backup by computing ; the dif- 
ferences from the ; most recent version(s) of the file and 
savir^ either- the differences or the new version in its 
entirety, whichever is smailer. Differencesrnay ■ be com- 
puted*^ represented in any manner l^bwn to those of 
ordinary skill in the art. The updated category, which 
can be thought of as a special user-defined siJbset of 
the modified category, 6erves to identify duplicate files 
across users which are updated on an irifrequ^t basis. 
One common instance of such fil^ ; would ^therriew 
version of the executabies of a word processor or some 
other popular application- Note that setting N y to zero 
eliminates the modified category (i.e., all changed files 
are in the updated category), while setting Ny to infinity 
eliminates the updated category (i.e., all changed files 
are in the modified category). 

1 .1 . Backup Directory Files 

The backup process of the preferred embodiment 
actually creates two files containing information about 
each backup set: a backup directory file (e.g., 143), and 
a backup data file (e.g., 144). In an alternate embodi- 
ment, these ties could be combined into a single file. 
The contents of the backup directory file indicate the 
directory structure of the source disk volume, as well as 
pointers into the backup data files (e.g., 144, 149, and 
backup data files of dther users) indicating where the 
data for each file is to be found. One key feature of the 
present invention is the data reduction achieved by 
duplicating pointers to data arid directory information 
instead of duplicating the information itself, including 
referencing duplicate information across users. To 
explain the role of the backup directory file(s) in accom- 
plishing this data reduction in the preferred embodi- 
ment, a description of key portions of the backup 
directory file (e.g., 143) for a DOS disk volume is given 
in FIGURE 4 in Backus-Naur Form (BNF), which is a 
well known formal language technique (for example, 
see NicWaus Wirth, Alqorfthnte + Qfitt Structure = Pro- 
giafls, 1976, pp. 281-291). Before discussing the con- 
tents of FIGURE 4, we will explicitly define the 
conventions of our BNF, since there are slight variations 
in syntax from one author to the next. Non-terminals are 
enclosed in angle brackets (e.g., cfMEntry The 
symbol indicates a formal definition. Terminals are indi- 
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cated as single binary digits (q or or as hexadecirrar, 
quantities usir^^p^ 

OxUUUU" fe f 16-bit words, ' and ■ oxuuuuuuuu *° r 324Dit 
dwortis. Ranges of terminal values are indicated as two ' 
terminal quantities with two periods in between; e.g., 

"one or the other", vyHile brackets 0 indicate an optional 
field; and an asterisk (*] (indicates one or more repeti- 
tions of the field. Thus; tor example. f<extemDirit©m yP Indi- 
cates zero or more of the non-termina! < 6xtefnD j rlte m > 
The, double slash// indicates a comment to the end of 
the line. 

FIGURE 4 defines the format of the directory infor* 
mation in a backup directory file (e.g., 143). At 200, the 
(voiumeDfrinfo ) section of the file is defined to be a series 
of (subdirRkiUst) records 201 , followed by a separate list 

Of (extemDiritam ) records 220; Each; ( 6 ubdirFHGList)^ecord ■ 
201 contains the directory entries for the files and sub- 1 
directories in a single cfirectory. In particular, as shown 
at 201 , each < su t>dirFiidLi3t ) record consists of a series of 
(fi)eEntry) and {subdirEntry) records 207/ 208 (containing 
the directory entries for files and subdirectories, respec- 
tively, found in the associated directory) and is termi- 
nated by an (ondofUst) marker 202 (for example, a 2ero 
byte). The (end0f L^) marker 202 is followed by < oxt em- 
Count) 203. which is a variable-length encoded integer 
temcount > 204. representing the number of <oxtornDjr1tem > 



lar encoding (204, 205, 206) of (HomCouho used in the 
preferred embodiment is not important; many simple 
alternate encodings would serve equally well, though it 
is usually desirable ibr the encoding to take advantage 
of the fact that small ^courits tte rriuch more common 
than large ones in order to minimize the average code 
siie. In an ^ternate em^ the < G xternDiritem> 
records 220 associated with each < 6 ubdirRk»Ust) 201 
could be stored immediately "after thr^,^^ field 
203 instead of in a separate section, as shown at 200; 
the preferred embodiment keeps these sections sepa- 
rate as a slight optimization, allowing the Agent 108 to 
scan through the W^re (^rtibirtirn > list 220 quickly to 
see which external items are referenced without the 
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In the preferred embodiment; the directory tree is 
represented impiicrtiy by placing the < iibiirf^oUst) 
records 201 in a conventional deptMirst ordering. In 
other words, each time a subdirectory is encountered 



record 208 for that subdirectory is appended to the cur- 
rent (subdirFiieUst) 201, and a token representing that 
subdirectory is pushed onto a temporary internal stack. 
When processing of the current subdirectory is com- 
pleted, the tendon and (externum > «elds 202, 203 are 
appended as shown at 201 . and processing then contin- 
ues in the subdirectory represented by the token which 
is popped off the internal stack, if the stack is empty, the 
entire directory tree is complete. In ah alternate ernbod- 
imeni another tree ordering (eg., breadth-first) could 
be used to achieve simitar results. 



V , : Sach explicit u «sEntryr*eo6rd 207 js .assigned^ : : 
directory item number (Qi^mNum) ^) ^ js unique 
ac^ 

en^odmert^fes^n^ 

quantity as shown at 223 in FIGURE 4; Ms Size is suffi- - 
, dertf'since it allows;, for exampie^ 
per day, each with 10,000 changed files, for a period of 
over 50 years before overf low could occur. Of course, a 
larger quantity of bits could be used if necessary. The 
range of (diritemNum) values 223 used is defined else- 
where in the backup directory file (e.g.. 143); In order to 
save space, each < flteEntry > 207 in the <vDiumaDHiioY ; 
record 200 is implicitly assigned the next (diritemNum )22 3 
in the range, so that no explicit ^ HtemNum ) 223;nf^s t6; 
stored along with each m^Bnty > 207, During 'the backup • 
process, when an unchanged file is found, instead of > 
duplicating the if^Entry") 207 for that file, a reference 
maybe added to the previous mBmv > 207 by including- ; 

<dirltemNum) the < dxternDir j tem) 220 list associ- 

ated with the current directory. In the fairly common 
case where multiple unchanged files are found with 
consecutive <difHemNum> values 223, in order to save 
space this sequence is indicated by a <manyuems> record 
222, consisting, for example, of a one-bit tag ( T ), a 31 -bit 
(diriiemNum) 223, and an < Ham count) record 204 which> 
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records 207 referenced. Otherwise, the <extembiritem) 
220 is represented as a (owitom) 221, consisting of a 
one-bit tag ( 0 ) to distinguish this field 222 from a -(many- 
Heme ) fi6ld. followed by the 31 -bit DiritemNum > 223 of the 

referenced Gentry) 207. Note that the ^rtcomu^ 
203 in the preferred embodiment counts the number of 

try) r^rds 207;ref^nced therebyi: . v ? 

The (ffle&i&y) art (subtiirEntry) records are of variable 
length and consist erf several fields, as Shown at 207 
and 206 in FIGURE^ Thfese fields ^e:dictet^;^;the 
attributes of the underlying fie system; for purposes of 
illustration, the definitions of FIGURE 4 include, 
attributes required for a DOS FAT file system, but obvir 
ous) m<kli^ 

ton 207 to allow for different attributes in different file 
systems (eg^ Maciritosh; OS/2 HPFS; NetWare, etc.); 
Th^hei^r of tte directory file (e.g., 143) in the 
preferred embodiment a field specifying the 

or- 



mat of the i^^ r6C ^^^^ ln this badojp In the 
example of FIGURE 4, the first field is the rtyeNamd) 
record, defined at 212, which Is a zero-terminated vari- 
able length character string ((as*-*), as defined at 219), 
representing the name of the file. Next comes the « eA t- 
trb) field 209, which is a single byte containing attribute 
bits, such as readonly, directory, hidden, system, etc. 
The file modification time <fktime> 210 follows; this 32- 
bit quantity includes both the time and date when the file 
was last modified. In more advanced file systems, sev- 
eral other time values could be added here, such as last 
access time, creation time, etc. The « festzo > field 21 1 is 
a 32-bit quantity representing the size of the 'file in 
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bytes, finall^f ;( f ^| D » 21 4 field indicates; wha^ 
file data as^aied with ^isijie^te found. As shown 
at 2 1 4, this information Includes a <us^rki^ >?2i6 l^a . • 
ffiieindex) 215>Each user on the backup .system has a ,.■ 
unique user number <uae r index> 216, which is a 1 6-bit 5 
quantity in the preferred embodiment .Similarly, ;each 
file is assigned a unique number, similar to the { dir ite m - 
M um) 223 for directory items; this ^irkiex j 215 is a 32^ 
bit quantity in the preferred embodiment: The two fields 
that mate up the <faeiD> r ?1 4 -pan be used to locate the , w 
appropriate backup data file (e.g., 148) containing the 
file data, as will be discussed below The (subdirEntry) 
record ^8cpnsi&^^ 

<toe Entry) record 207, except that (fu 9 Tinw> ^eJd 210 indi- 
cates the directory creation time. In an alternate embod- 15 
iment, each <fiieEntry> record 207 also contains a 
Oast Version) which is a ^mMum) 223 * at directly 
references the <fiieEniry> 2 P 7 fc> r ^ 6 previous version of , 
the file; this technique provides a linked list of all unique 
versions of the fye, which could be reconstructed con- 20 
siderably more slowly by reading and parsing all backup 
directory - flies (including those where the file v wAs 
unchanged). . . ■ • , r . •• • 

In the preferred embodiment, there is no way to ref- 
erence unc^iahged < 8 itxiirEntfy ) record^ 208}fram previ: 25 
ous backups, in other woids. the errtire h tree of 
directories must be explicitly represented in each 
backup directory file (e.g.. 143). although the files within 
those directories can be incorporated by references in 
the 

(extambirttem) section 220, as discussed above This 30 
somewhat arbitrary decision in the preferred embodi- 
ment was made to simplify the backup and restore logic 
slightly at a small cost in the size of some backup direc- 
tory files, but in ah alternate embodiment.it would be 
simple to allow referencing unchanged subdirectories 35 
(and entire file/subdirectory trees). The size of the 
backup directory file (e.g., 143) is normally, a small frac- 
tion of the size of the backup data file (e.g., 144), and 
the contribution from the subdirectory entries alone to 
the size of the backup directory file is normally hot sig- 40 
nif icam. so this issue appears to be of very minor con- 
cern at most. 

A somewhat related and possibly greater concern 
is, the fact that according to the definitions of FIGURE 
4, there is no limit on the number of external backup 45 
directory files (e.g., 148) referenced by a given backup 
directory file (e.g., 143). ff this number were to grow 
without bound, when it came time to perform a restore, 
the amount of time required to reconstruct the directory 
tree could be quite large, even though all the backup so 
directory files are on disk. In practice, in the preferred 
embodiment, there is a limit, imposed at backup time 
during the construction of the backup directory file (e.g., 
143), on the. number N D of external backup directory 
files (e.g., 148) which can be referenced. Typically this 55 
number is set in the range of N D » 5-20 files. The result 
is that the <fi te Entry> record 207 for an unchanged file is 
explicitly ^included about every N D backups. This pro- 
duces only a tiny increase in the overall storage require- 



ments on the backup storage 101, but it guarantees a 
reasonable response itime during the restore operation. 

Note that, in the preferred embodiment, each user's 
backup files may only reference < 0 xt9rnOiritem> recorcte 
220 from- his own ^ 

backups of other users This decision, which results in a 
very minor cost in: cve7all)storage requirements, stems 
from the desire to maintain privacy'of all user directory - 
information: As we will see below, the contents of each ■ : 
backup directory file are encrypted so that no other user • 
can even see the names, sizes; dates, or. attributes of 
anothers files, which might in themselves cbmprbmise 
privacy; eyeh: witto^ 

By contrast the data contents of files that are backed up 
can be shared between users, with privacy insured via a 
unique encryption key prrtocol discussed below. If the 
size of the bac^p$^ ■ 
issue in an alternate embodiment (e.g. , a new type of 
file system), techniques similar to those used for data 
could be applied to directory entries to save space. if 
desired. For the file systems of interest to the preferred 
embediment at this^time, however, tt1ere ^ ij^^s ip be 
no compelling need to minimize the size of the backup 
directory files ^^0pr:i^: 

Jq\}kx^^ d the BNF defini- 

tions in #IGURE 4/FIGURE S contains a ^rtalj exampie 
of the format of a <voiumepirini6 ) 'ecofd 200. The format 
is that of 8086 e^embly language, which allows for very 
flexible (if sorhevvhat primitive) output of variable length 
fields. A semicolon (;) serves as a comment to end of 
line. Trie following directives are used to control output 

emit 8-bit byte(s) 

dw « emit 16-bit word(s) 

dd = erhit 32-bit dword(s) 

Hex constants end in K (e.g., soooooooh)- Several fields 
in the example are left undefined using the (?) expres- 
sion; for example, the file time/dates are unspecified 
because the particular times are not of interest for pur- 
poses of this illustration. In general, in order to clarify 
the usage, each line is followed by a comment contain- 
ing the BNF non-terminal (s) corresponding to that line. 
The lirie-by-line cements refer directly to the BNF of 
FIGURE 4, which has boon described in detail above. 
Note that the (oxtorndintem > l&t starting at 366, does not 
give any indication Of the actual contents of the (fije&ttry) 
records referenced; ft is necessary to read and parse 
the contents of the separate referenced backup direc- 
tory f ile(s) in order to obtain the directory entries. 

In the preferred embodiment, each backup direc- 
tory file (e.g., 143) contains several other sections. 
These sections are described briefly here. They gener- 
ally involve welt -understood techniques that are used in 
many other backup products, and are therefore readily 
understood by those of ordinary skill in the art. However, 
it is useful to give a brief explanation of the contents and 
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purpose of these other sections to give a broader back- ' embodiment described here are not intended to limit the 

ground context for the present inventions. Each section scope of the present invention. ^ \ 

iscove/edby-achecksu^ = - '--v — > l vn 

checks, and many of the sections (including the < volu . 1.2.-8adojp Data Res - v ^ * - . -\ { - ' 

medirinto) section 200) are compressed using, , ' well - 5 " - - * v - . ^ 

known compression techniques, such as those . \ The backup data file (e g., 144) contains data from 

described in U.S. Patent 5,016,009, or U.S. Patent - the files included in the backup set. Some of this data 

Application Serial No. 97/927,343 (filed August 19, may be represented by references into other backup 

1992, entitled "DATA COMPRESSION /APPARATUS data files from previous backups (e.g., 149), eitherfrom 

AND METHOD USING MATCHING STRING SEARCH- w this user or another user. Each unique file included in 

ING AND HUFFMAN ENCODING"), both of which are the backup data file is assigned a «kfa 9X) 215, which is 

assigned to the assignee of the present' invention and a 32-bit number in the preferred embodiment and which 

both'of which are incorporated herein by reference. In j S to reference thatfiie. Observe.that there is no 

addition, each section (other than the header) is one-to-one conespondence between- <dimemNum> 223 

encrypted using a private key encryption scheme, such j 5 and < ffl6hldex) 215 values. For example, 1 if user A has an 

as the Data Encryption Standard (DES) or RSA's well exact copy of a file that has already been backed up by 

knowrv ftra>Rc4aJ£^ .useWBilii^ 

protocol for this encryption is discussed in detail below. fle|D) 214. (i.e., t^i^x) 215 and {uae rtndex>, 216) as 

Finally; some primitive error correction ability is incorpo- • user B's, but they will have distinct < dir i tem Num> values 

rated into each file by appending a section of overall , 20 223, "which are not shared between users in the pre- 

parity sectors at the end of the file contents. ferred embodiment as discussed previously. Most of 

Each backup directory file in the preferred embodi- the data in a backup data file is compressed, the data 

ment begins with a header which includes a signature from each file included in the baota^j set: is encrypted 

and creation time stamp, as well information on file for- using a file-specrfic key (stored in the lileDecryptKey" 

mat version; file size, and pointers that identify the loca- 25 section of a backup directory file, such as 143) instead 

tion and size of all other sections: A *bkupDe^cription" of a private user key; in other words, multiple encryptidn 

section contains descriptive information about the •■ keys are generally used in each backup data file. The 

backup iopemti anhpta- key n^gehtent protocol used to guarantee data pri- 

tion siring; time of the backup, count of new files and vacy will be explained in detail belowi but the net result 

bytes, ranges of new <dtritemNum) arid <fiietnde*> records 30 is that, in the preferred embodiment, the contents of 

223, 2 15 generated, a backup file number, the (yandex) each backup data file are effectively publicly available, 

216. and a specification of the source volume for the by contrast to the contents of the backup directory file, 

backup. A "dirlndexRange" section is a small variable- which are encrypted with a private, user-specific key. 

length record which identifies the exact set of new ^ A high-level block diagram of the layout of a backup 

itemNum) values 223 included in the file, from which the 3$ data file (e g., 1 44) of the preferred embodiment is 

UritemNuin> 223 assignments are rnade in (voiuriidDfrinfo) shown in FIGURE 6 The file consists of four main sec- 

209, as discussed previously; normally, there is Only a tions, of which the Date Blocks section 161 is typically 

single contiguous range of values, but it is possible after by far the largest since it contains the actual contents of 

the Agent 108 has performed a consolidation operation the files induded in the badkup set Because of its size, 

(discussed below) for multiple non-contiguous ranges to 46 the Data Blocks section 161 comes directly after the 

exist in a single file. A "dirrtemPtr section contains an fixed-size Header 160 in the preferred embodiment so 

array of pointers into (^iumeDirinio > 200, one pointer for that file data ban be written directly into the backup data 

each (fiieEntry) 207. This section is actually redundant file without ever having to move the data blocks again, 

and can be reconstructed by parsing (voiumebirinto) 200; the Header section 160 oontains a signature and crea- 

together with the "dirlndex Range" section, it serves to 4S tion time stamp, as well as information on file format ver- 

speed access to a <f gentry > record 207 from a separate sion, file size, and a pointer 162 to the FilelnfoPtrs 

backup directory file via a (dirttemNum) reference 223. section 1 78. Like the backup directory file, the backup 

Finally, a •fileDecryptKey" section contains a private data f Be may also contain parity sectors to allow simple 

encryption key (e.g., for DES) that is used for decryption error correction in the case where small disk flaws 

of the data file contents. There is one key in this section so develop on sectors oh the backup storage means 1 01 . 

for each <faoEntry> 207 in (voiunwDirinfo) 200; in fact, con- The FilelnfoPtrs section 178 contains a variable size 

ceptually this key is part of the fffl 9 Entry i record 207, but record indicating the exact Set of ifii«index> values 215 

it is placed in a separate section in the preferred embod- represented in the backup data file; this record is very 

intent solely because including it directly in the wintry > analogous to the "dirlndexRange" section of the backup 

207 would lower the compression ratio of the ^oiumooir- 55 directory file discussed above, and typically consists of 

into > section 200. only a single contiguous range of values: The rest of the 

■u There are many equivalent ways to organize the FilelnfoPtrs section 178 oontains an array of fixed size 

information in the backup directory file to achieve simitar entries (e.g., 181), one entry per <fii«indeic> value 215 

results. The particular record formats of preferred represented in the range. Each entry contains a pointer 
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(e.g., 179) into the Filelnfo section 175. where there is 
one variable-sized entry (e.g., 176) pe; file. In additiorv- 
each entry (e.g.. 181) in the FilelnfoPtrs section con- 
tains other file-specific information (such as file size and 
a CRC over the initial . blocks of the f ile c»nteriS) • 5 
required to enter each new or updated file into the glo- 
bal directory database 145. Each Filelnfo entry (e.g., 
176) contains information on the contents of the file, 
including a variabie length array of pointers (e.g., 173) 
into the DataBlock section 161 or into the contents of w 
files contained in other backup data files. 

For example, FIGURE 6 illustrates some details of 
two ties contained in the backup set The FilelnfoPtr 
entry 181 for File A Includes a pointer 179 to the Filelnfo 
entry 176 for File A. This entry 176 contains a set of is 
pointers 173, including pointers 164 and 166 to data 
blocks 163 and 165, respectively, in the DataBlocks sec- 
tion 161, as well as pointer(s) including 167 to data 
blocks in other backup data files. All data blocks (includr 
irtg 163 and 165) in this backup data file associated with so 
File A are encrypted using the encryption key for file A, 
as shown at, 168; this key is stored in the "fileDe- 
cryptKey- section Of the backup directory file(s) contain- 
ing a <fBeEntry> whose tfi | e | 0 > 214 references File A. 
Similarly, the FilelnfoPtr entry 182 for File B includes a 25 
pdnter 180 to ttie Filelnfo entry 1 77 for File B. This entry - 
177 contains a set of pointers 174, including pointer 170 
to data Wpbk 171 ;jrt the DataBlock section 161. as weil 
as pointers) indudirigii 72 ^to <iata blocks in other 
badaipd^^ files. All de blocks (including 171) in this so 
backup data file associated with File B are encrypted ; 
using the encryption key for fiie8 i; as shown at 169. ■:. 

Given a (fitelD> 214 and the decryption key; it is rel- 
atively straightforward to extract the file contents to 
"restore" a file. First a search is performed through the 35 
backup data files of the user identified by <usorindex) 216 
for the backup data file containing the <fHeindex> 215 of v : 
interest This search can be easily performed, because 
the Header 160 and FilelnfoPtrs 178. which contains 
the range of <fiteindox) values 21 5; are not encrypted. In 40 
the preferred embodiment the search can normally be 
performed even more quickly because the Agent 108. 
as part of the migration process of backup data files 
from\BACKUP\USERS 121 to\BACKUP\SYSTEM 122, 
builds a special Index Range Lookup file (e.g;,151 of 4s 
FIGURE 3). This file, which is redundant Hi the sense 
than it can always be re-built from the contents of the - 
backup data arid directory files, includes a table which : 
maps index ranges into backup data file names and 
which is arranged for a fast binary search. With the so 
appropriate backup data file identified, this file is 
opened, and the pointer 162 to the FilelnfoPtrs sections 
is read from the Header 160. the index range record of 
the FilelnfoPtrs section 161 is then scanned to identify 
which pointer corresponds to the given an a ihdox> 215; '55 
that pointer (e.g., 1 79) is then used to index the Filelnfo 
entry (e.g. » 1 76) for the f ile of interest From the Filelnfo 
entry, pointers (e.g., 173) are found to the data blocks 
corresponding to each portion of the file of interest 



v^ichi-may reskle erth^ in this backup data file (eg:. 
163) or an "external" backup data file (e.g., 149). Theses 
blocks are then read, decrypted; and decompressed to, 
provide the original file wntente 
easily,that accessing any portion of the file-contents 
require only a handful of disk accesses. Although the 
hunger of ^ ^an would; be 

necessary to access a file on a "normal" file system, it is 
still. small enough that the access time during restore is- 
measured In milliseconds (or tenths of seconds at 
worst), not the tens of seconds. or the minutes normally 
associated with restore op erations from conventional > 
tapg.back^!^ 

"access" time, the restore software may include an intel- 
ligent caching algorithm for the contents; of toe backup 
directory and backup data files: , 

Given this high-level understanding of the various 
sections of a backup data, file, consider FIGURE ;7, 
which com&ris a sd of ^ 

ably more detail on the format of some of these sections 
At 400, the entire file <bkupDataRie > ^ defined to consist 
of the four main, sections discussed above Uhoaden 
section 40 1 „ <a&taBiock ) section , ;405 f < f aeinfo > ssctip 0 
408, arKJ <finte p tre> section 432); the remainder of FIG- 
URE 7 describe the bont^te of these sectioris j . 

The relevant, p^i^ 
listed at 401. In particular, the «i n fof>trOffe*t> f j eW & 
defined at 402 as a pointer (32-bits in the preferred 
embtfimentj tothaiktart dMhe ^ihtePtre) section 432. 
The <inde*RangeCM> f '^ d is defined at 403 as a count of 
the number of <jnd ^ Ra ^g d) entries 433 in the <finfoPtre > 
section432. 

At 405 and 405, each <dataBioi*> is defined to be & 
variable-length array of 84>it bytes. In the preferred 
embodiment each <dataBi<wk) 405 starts at an offset in 
the backup data file (e.g., 144) which is a multiple of 4, 
so that the offset can be encoded in 30-bits. This con- 
vention allows slightly tighter packing of the <s aeki*Mno 
fields 416 at a very minor cost in the overall size of the 
backup data file, but this optimization is in no way critical 
to the invention. Each datablock may be compressed 
(indicated by fpackpiag) 422 of the assodated 
lockPtr) 419) arid then encrypted. The encryption key for 
each (^taBJock) 4Q5 is not kept in the backup data file 
(e.g., 144); as discussed previously, the keys reside in 
encrypted from in the backup directory f fle(s) which ref- 
erence the associated ffle data blocks. Typically the ^ 
aBiock) section 405 is by far the largest section in the 
backup data file. As part of the encryption process, a 
checksum is appended to each (dataBlock) 405 in order 
to facilitate a quick check for corruption, either of the 
block itself or of the pointer to it 

Definition of the <f| n toPtra> section begins at 432. In 
particular, this section consists of two variable-length 
arrays of <jndoxRang«> 433 and wointoDau) records 436. 
As discussed previously, the number of (indoxRango) 
records 433 is indicated by the <ind«xRangoCnu field 403 
of the teadar^l. Given the <jrytex«angeCno va'ue 403 
and the size of each GndexRange) record 433 (8 bytes in 
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the preferred embodiment), thelc>cation of the first ^ )n . 
foData) Record 436 is easily deduced. Normally a tjiaci^.; 
data file has only one andexRang* > 433 (ie., a single con- 
tiguous range of file indices),;but it is possible after the 
Agent 108 has performed a consolidation operation- 5 
(discussed below) for multiple non-contiguous ranges to 
exist in a single file. Each (^Range > record ^ cbrW 
sists of two fields: an ^i^^^) 434 and an tindexQount, ■ 
435. each of which are 32-bit values in the preferred 
embodiment. The <hdexBase> Y^ue 434 indicates the w 
first file index in the range. Tne ^indexCount). value 435 
indicates the number of file indices in the range. The 
sum ^ ^ 

records 433 indicates the number of tfiaintoData > records 
436 in the file. In the preferred embodiment the file 15 
index associated with each 1 ifgsi^DfltftY record 436 is 
implicitly assigned sequentially from the ordered set of 
file index values generated by the Qfl0texRangQ , record(s) 

In the preferred embodiment, each unetnfoOata) 20 
record 436 is Of fixed-stee, consisting of four 22-bit 
fields, as shown at 436 - 440. In particular, the (fUainbPtr) 
value 437 points to the associated variable-length 
tfiteinfo) record 408. The (Resize) value 438 indicates the 
size of the associated file in bytes. The <amnfc>CRC) value 25 
439 is a hash value (a CRC in the preferred embodi- ' 
merit) computed over a portion of the directory entry for 
the associated file; use of this fixed-size value instead of 
a variable^ength directory entry simplifies the search for 
matching files between users. The (paniipjieCRc ). 440 is 
a hash value (a CRC in the preferred embodiment) com- 
puted over the first portion of the file. In the preferred 
embodiment it covers up to the first Np = 256K bytes of 
the file (which is all of the file in most casis). When 
searching tor matching files across users; the backup 
application toads N? bytes of the file into memory and 
wriiputes ahash value { (partai^cRC > 446); then per- 
forms a preliminary search through the global database 
(eg., 145) tor matching records 408. H a match 
is found, then a more complete match can be verified 
using the full tfjjecRo field 409, although there is usually 
no need to perform this further check since most files 
are smaller than N P bytes. Using this partial-file hash 
technique generally allows a single-pass seaith for files 
that are too large to fit into memory, instead of having 
the read the entire file once to compute the Mcrc > 409 
and then a second time to back up the file contents if 
there is no match. 

There is one variable-6i2ed (roeinto) record 408 tor 
each file included in the backup data file The < ffleCRC) 
value 409 is a hash over the entire contents of the file; 
in the preferred embodiment, a CRC is used. The < bH . 
Fiewa> record 410 contains several small bit fields indi- 
cating various attributes of the aMnto) record. For 
example, the (roiciit) field 41 1 is a two-brt field in the pre- 
ferred embodiment indicating how many external files 
are "referenced" in reconstructing the contents of the 
file, and can take on the values 0 (hb external files). 1 
(one external file), or 2, while the value 3 is not allowed 



in the preferred embodiment This particular limitation is 
imposed only to optimize the encoding of the : (dataPtir ^ 
field 419; in theory, there is no reason why more exter- 1 
nal fiies-c^id ^ irt "piacticfeit' :V 

is very rare for more than one external file to be refer- 
enced; the previc^s-f Re version. The value of the \ reK:ni , ' 
field 411 indicates the number of <fa e Rei> records 426 
that are included in the < flWnfa) record 408. The > 
field 412 is-a^six^bit field in the preferred embodimenl 
arKliS:^ 

value 412 for any exte^|^c^^|il^i^^ted^in' a 
((aejif v rec^ ^6; or i * 
the < re fLev e i) value 412 counts the maximum levels of 
"indirection" required to access any portion of the file) 
contents; this value is limited in the preferred embodi- 
ment to a user-settaWe parameter N L (typically in the 
range 5*10) in order to set an acceptable bound on ' 
access time to the contents of the file at restore time. 
Whenever the < re fUvei> value 412 would exceed the N L 
value if a particular external file were referenced, the 
data from the associated block is duplicated instead of 
bang incorporated by reference. The {isQlobal) bit 413 
indicates whether the given file should be entered into 
the global directory database 145; it is ^ for new and 
updated files and o for all other files. • * 

The < aeek ptsi record 414 contains a count (^pt- 
count, 415 (32-bits in the preferred embocfime^ 
nurrtoer of <s^Poj n t > records 416 in the (s^kpts > reopfd. 
Each (seekprint > record 416 consists of the starting {u> ^: 
30 caioffeGt) ^u'* 4i7 eisc^ed with; the <^tartt> 416. 
followed by a pointer (^^> 41 8 to the associated data 
The teeikPowit) array 416 is saved in sorted order based 
on the ^Mibffeiij values 417, allowing a quick binary 
seardtto find the ^j^m) 416% any partiailar Ipgl- 
35 cal offset in the file. The number of bytes of the file *cov^ 
ered" by each <^ k p^) 416 is easily calculated by 
subtracting its ^j^oi^oi) value 41 7 from the <ibgiiaott- 
sot) value 417 of the succeeding ^^in^ 41 6 (or from 
the (fitesize) f,eW 438 tor the last <8dekPoint) record 416). 
40 There is no firm limit in the preferred embodiment for the 
minimum number of bytes covered by a < 8e ekPb>m> 416 , 
but typically the blocks are fairly large (dK bytes or 
more), although this may decrease (or increase) as por- 
tions of external ffles are referenced; 
45 The <dataf?fr> TiekJ 41d carl take one of two forms: 
either e (iata&bckPtr ) 419 reference to a tdataBiock) 405 in 
this badoup data file, or a WxternPtr) 420 to an external 
file. In the preferred embodiment, these two fields each 
consist of 32 bits and are distinguished by the value of a 
so single type bit in the uword)* as shown in 419 and 420. If 
th® (daiaPfr) fteld 418 is a <dataBibckPtr> 41 9 (as deter- 
mined by the type bit being 0 as shown at 41 9) , the ^a*. 
nag) bit indicates whether or not the associated 
uataBbck) 405 is compressed, and the {wockom) field 
55 421, which comprises the remaining 30 bits of the 
tiataPu) 418 in the preferred embodiment, points to a 
UataBiock) 405 in this backup data file. As discussed 
previously, each <dataBto<*> 405 starts on a 4-byte 
boundary in the preferred embodiment, so that the 30 
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bits is sufficient to represent any <daaBibck> offset in the 
file, if the aataPtr) 418 is a ( e xternPtr> (as determined 
by the type bit being V as shown at 420), the 6-efFUeNo>b*rt\ 
424 indicateis wH/^ file is bei^ referen^ 
two files can be referenced in the preferred embodi- 5 
mentj, and the ^otfe) value 423 is a signed relative log- 
ical offset from the oogjcajoffseu 417 oUhis (66ekPoint) 
416. indicating the absolute logical offset in the external 
referenced file where the data associated with this 
Point) 416 can be found. Notice that accessing such an 10 
external block given this logical offset requires parsing 
the <fjjeinfo) ^Mo" 408 ancl <^p^ t) r^rcis 416 of 
the referenced file in another backup data file, which 
may in turn reference yet another external file; hence 
the limitation N L on the number of reference levels. Jn 15 
the preferred embodiment, the < re ioffs) field 430. Is only 
30 bite, so a referenced external block must Start within . 
+/• 512 Mbytes of the given ((ogicaicmset) 417, which is 
not a limitation in practical terms, although this restric- 
tion could easily be removed by extending the size of 20 
the (dataPtr > * 'G^ 41 8 when dealing with extremely large 
files. :' : .". > ■ ' • - - 

The optional u^r*/) records 426 indicate which 
external f ile(s) are referenced by the 4^^) fields 420 
of the < 8 eekPoint> 416 these, ^Ref> records 426 ss 
are enayptedwith the same ertcryption key used for the 
<dataBiock> records 405 for this file. The (fftet D> record of 
the tfji«R«f> 426 is identical in format to the (f^io > record 
214 used in the backup directory file, containing the 
ttitindex) 215 and (usdHrkfex) 21 6 f thiat ideiltify the 30 
par^cul ar,f iie JSi^ng i^toenced. The <d^yptKey > record 
427. Which CprtSi$ts of 64-bitsirtth^# 
ment, contains the private encryption key used for the 
referenced f Be, This key is also contained in the backup 
directory filers) which contain ( m ^ } records indicating 35 
this r^fere»hced fie, but the key is duplicated here 
because it may only be otherwise available from a 
backup cUreclpry file of another user, which is encrypted 
with that user's personal encryption key. Hence, 
although the key is included here, it is encrypted to 40 
restrict access to only those users who have legitimate 
access this file, as discussed below, so as not to com- 
promise the privacy of the referenced file. 

FIGURE 8 gives a detailed example of the <seekPta> 
record 4l4.fbr a hypothetical file X. The (seokptcdunt) °* 45 
this record is 5, as shown at 450. Thus, there are five 
(3 CGkPtolnt ) records, 451-455, each of which is broken up 
into its (logicajoffwt) f'^d (e.g., 456) and its (daiaptr) field 
(eg., 457, 458, 459). The first 

feeekftaint) record 451 has 
a starting logical offset of 0 as shown at 456, and this so 
<seekPoint) record covers the first 81 92 bytes (0-8191) of 
file X, since the second 

faeekWrrt) 452 Starts with logical 
offset 8192. These 81 92 bytes associated with the first 
<3ookPcMnt) record 451 are found in a <daiaBibd<> within 
this backup data file, as is indicated by the type bit 0 at 55 
458 which identifies the < da taPtr> record of 451 as a < 

dataBlockPtr)- Tne <blodcOffs) f^ld 457 Of the first (as ©k Point) 

record 451 contains the value 128. indicating that the 
associated <dataBk>ck> is to be found at offset 512 (i.e., 



4*128) in this bactajp data file, and the v bit in the <p a i. 
Flag, field 459 indicates that this <dAtaBit**) is, com- 
pressed. Similarly, the second (seekPointr record 452 , 
covers the.bytes 8192-11999 of file X, but these 3808 
bytes are to be found starting at logical offset 8492 of 
the external file indicated by the first > record;(ref 
file #0) in this <fite i nto) record. The offset 8492 is com-, 
puted by adding the <iogicaiofteat) value of the second 
fesekPoint) record 452 (i e., 8192) to the < re ioits) v^ue of 
the (dataPtr) record of the second <seekPomt> record 452, 
which is an <extemPtr> as indicated by the ^ type bit in the 
(datai^>;t^ (refFUe > 

is referenced (0 in this case) Thettird-tseAkftmti record 
453 covers bytes 1 2000 -16383 of file X and indicates • 
an uncompressed data block ^ starting at offset 4080 of 
this backup data file. The fourth < 8ee kPoint> record 454 
covers bytes 16384 - 23008 pf file X, and these 6625 
bytes are to be found in reference file #1 at logical offset * 

15384 (flogi^c-ffset) + <relOHs> - 16384 - 1000 = 15384); 

note that < re iotfs> is a negative number in this case. The 
fifth (and last) (^p^) record 455 covers all the 
remaining bytes of file X; for example, if the ^s^ 
30000, this block consists of the 6991 bytes 23009 



29999. T^e bytes are found in a compressed <dataB . 
lock) at offset 8472 of this backup file. This example . 
shows how simple it is to interpret the ^^p^^structuiB*^ 
and it is obvious that a binary search on the oogicaiofteet). 
field can be used to locate any section(s) of thefilevery 
quickly. Z -V '. 

The ((prints) sectjpn 428 of the < ffle | nte) record 408 
contains hash functions or "fingerprints" computed over 
fixedrsiae portions fchunks") of the fie contents. The 
purpose of these fingerprints is to allow efficient proba- 
bilistic searching of matching chunks between file ver- 
sions without having to fiilly -eMi^\-the^odnt^^tto 
previous file version. The idea of using fingerprint func- 
tions in this fashion was first conceived by Karp & Rabin 
[karp, Richard M., and Michael O Rabin, "Efficient Ran- 
domized pattern-Matching Algorithms", Harvard Univer- 
sity Center for Research in Computing Technology, TR- 
31-81, December 1981]. Fingerprints are particularly 
effective when performing backup of modified files over 
a low-speed communications link when the backup data 
files are at the remote site, as discussed in a subse- 
quent section. In the preferred embodiment, although 
there is no absolute need to use the fingerprints (since 
the previous file contents can be explicitly produced for 
chunk matching), the fingerprints are stored in the 
backup data file anyway to facilitate such bandwidth 
optimizations; in particular, even ever local area net- 
works rt may be desirable to minimize network traffic 
when backing up large files with only small modifica- 
tions. The Size of the chunk Used for fingerprinting, 
which may vary from file to file, is indicated by «pchunk- 
sizq) 429 and is typically in the range of 256 to 8192 
bytes; a value of 0 for <tpCNunksi*> > 429 indicates that no 
fingerprints are stored, like the (dGcryptKey , record(s) 
427, in the preferred embodiment, the. (fprinis) record 
428 is encrypted using the associated encryption key 
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from the •fileDecryptKey" section of the backup direc- 
tory file, v _ - < ; : - , . ^ « ' 

The basic idea behind fingerprinting, as described 
in detail by Karp & Rabin, is to choose a hash function 
which is easy to "slide" over a chunk of data. In other s 
words, as the chunk starting location Is rnpved frbrh one 
position in the file to the next the "dp^ibyte exits the 
chunk "window 4 *, the intermediate bytes shift over one 
location.Vand a new one enters the wii^jow^Karp & 
Rabin describe several types of linear fingerprint func- 10 
tiorts which are easy to update given the current finger- 
print value and the oldest and newest bytes. for 
example, a modulo, 256 sum is a particularly simple ^ 
case (too simple to be useful in practice); but CRCs and : 
other similar functions are qurte acceptable Given the > 15 
set of fingerprints for the chunks of the previous f ile con- 
tents, the fingerprint function' is computed by sliding ; : : 
over chunks of the current file contents, checking for a "' 
match with any of the previous fie fingerprint values at 
each byte location. When a match is found, that chunk 20 
in the current file is assumed to match the chunk asso- 
ciated with the fingerprint value in the previous file. The 
fingerprint function can be chosen to be large enough 
(72 bits iri^fte preferred ernbodime^ 



already contained in the <fl)elnfo) 408for the referenced 

There are many possible variations on the particu- 
lar layout of records in the backup data file of the pre- 
ferred embodiment: For example, in an alternate 
embodiment each m hto j record 408 could be placed 
directly after the set or v <dataBk>ck> records 405 associ- 
ated with the file instead of in a separate section; for 
example, this change might be represented by singly 
changing definition 400 to read: 

ftskUpDataFlo) -== (header) [(dataBLock) * file Info J* «ln- 

fePtre) 

Similarly, some fields, such as (fJte cRc ) 409 and o^iota! > 
413. could be moved from { ^ lnf0) 408 to ^^p^ , 
432, or vice-versa. In some file systems (e.g. v Windows 
NT NTFS), 64-bit file pointers would be used instead of 
the 32-toit pointers Of the preferred embodiment It 
would also be simple to modify the format slightly to 
allow for more reference files or more reference levels. 
Such changes do not affect the basic idea; and trie par- 
ticular record formats of preferred embodiment 
described here are not intended to limit the scope of the 
present invention. 



ity of false hna^ 1 .3 Global D^ecitory Database File 



smaller than the probability ol ^omge medium failure 
(typically TO" 15 ) so that no further validation is neces- 
sary Alternately, this sliding fingerprint mechanism can 
be used solely as a search technique to identify areas of 
probable matches and then fully validate them by 30 
extracting the old file contents and Performing; a com- 
plete compare. It is also possible to use fingerprints only 
in a non-sliding fashion; this approach works particu- 
larly well tor very large (e.g., database) files where 
records tend not to move, while for smaller files, where 35 
the bandwidth consumption is not as much of an issue, 
a full compare could be performed in this case. In an 
alternate embodiment, a global database could be built 
of chunk fingerprints instead of entire files, knowing 
matching of portions of files across users; but the 40 
expected gain in storage space from such a scheme 
does not appear to be worth the extra overhead 
required. 

In the preferred embodiment each ^^ ht ) record 
430 consists of nine bytes (72 bits) of fingerprint func- 45 
tion (CRCs), plus the first three bytes of the associated 
chunk, for a total of twelve bytes. Using these extra 
three bytes allows the fingerprints to be computed and 
compared on a sliding dword-by-dword basts instead of 
a byte-by-byte basis, which speeds up the computation so 
considerably. However, other than speed, the net result 
is the same as a byte-by-byte slidingf ingerprint compar- 
ison. There is one <fingerpririi> record 430 per chunk of 
the file; however, in order to save disk space, no 
pnnt) records 430 are included for chunks which are ss 
entirety contained in external file references (via 
<externRr> records 420) with identical <fpc hU nkSizo> 
429 and which are on chunk boundaries in the refer- 
enced file, since the fingerprints for those chunks are 



With a knowledge of the information contained in 
the backup data and directory files and of how the infor- 
mation is used to represent file data conterrts, the tech- 
nique for searching for matching files can easily be 
explained. In designing the global database, it was 
assumed that there could be millions (or tens of mil- 
lions) of new/updated files entered ihta the dat^se. 
For example, a survey of ninety user workstations at 
Stac (the assignee of the present invention) revealed a 
total of about 250,000 unique files across all the disks, 
and the preferred embodiment is designed to handle 
systems with at least that many backup nodes. Thus, it 
is important to minimize the network banoSyktth con- 
sumed by the search process, which might easily dwarf 
the f0e data traffic during backup unless great &re is 
taken in the database design. In particular sev^raJ con- 
ventional database approaches (e.g.« B-Tred) were con- 
sidered and rejected in light of this concern. While there 
may be other types of database architectures that work 
well, the structure of the database of the preferred 
embodiment is particularly efficient for the type of 
searches required Here. 

During the backup process, each node may have 
thousands of new/updated files that need to be 
searched against the global database. Generally there 
will be considerably fewer such files once the initial 
backup is completed, but the worst case must be han- 
dled. By contrast there may be millions of files already 
entered into the database. Thus, it seems initially that a 
client/server embodmeht with a backup server, in which 
the client sends its (relatively small) list of hew/updated 
files to the server, which in turn does the matching 
against the targe global database, should have a signif- 
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icant advantage in network bandwidth usage over a backup. As discussed above/ rt is always possible: ta ; - 

shared-file system. However, the overhead of perform- perform/at the users option, erther an exhaustive com 

ing the search in the shared-file environment of the pre- parison of the contents of the apparently unchanged 

ferred em^imerrt is optimized to Ae poirt files or a comparison of 

drawback is not significant in practice. * v s improvement in certainty level is rarely considered to be . 

. When searching for matching files across users, it s worth the extra effort and overhead. , . «• \ 

is usually deemed sufficient to have matohing file sze, ^ Given the ^oCRC) 439, mS \ze) 438, and ^ nC ) ; 

file name, time/date, and hash value (e.g., CRC) com- 409 values for a particular file to be backed up. a search 

puled over the. file contents. While this approach does through the global database of the preferred. embodi- 

involve a finite (though minute) probability of false w ment is performed in an attempt to find a matching 

match, the error probability is acceptably small for entry. As noted previously, the^^^Rcy 440 value 

almost all practical' applications. In &n optional user- is actually used initially instead.of the full (t^cFto 409 ■ 

invoked "exhaustive compare' mode, this probabilistic , as an optimization/ sincere forrr^r were the . j 

type of match serves only to initiate a complete byte-by- file in most cases; for large files the <f lle cRc> 409 is then \ 

byte comparison of the contents 1 of the two files; how- is verified in the preferred, embodiment by looking into the ^ I 

ever, the overhead of this mode is large enough, partic- backup, date file containing the; ril9[rif0) 408 for mat file. • 1 

ularly in light of the practically negligible improvement in ^ Each global database entry contains the ^^nfoCRC > 439, \ 

the level of certainty obtained thereby, that invoking: (four bytes), (fi3&Ske) 438 (actually only the least signrfi : I 

such "skeptical- behavior is best done infrequently, if at cant sixteen bits in the preferred en^odiment), and ^ \ 

all. In alternate embodiments, the match criteria can be 20 tialffi6CRC) 440 (four bytes) .values for the associated file, j 

further loosened not to require a matching file name or which are extracted from the backup data file containing { 

time/date; for example, two files 'REPOftT.DOC and the (fUeinfodata) "*ord 436 for that fie. In addition, each 

'REPORTBAK' might be judged to be matches H all eritry oorttains the < fj | e(D) record 214 (six bytes), which 

other parameters are equal There are many variations can be used to locate the actual file data contents. The ! 

on this theme; for instance, perhaps just the file names, 2s totaj ( size for an uncompressed database entry is thus ! 

e*dudjngl^ are cprir^red. or perhaps only fixed at/16 bytes in the preferred embodiment. If there 

the first few (e.g., 4-6) characters of the file name are are N = one million files in the database, downloading 

compared in an attempt to include minor file renaming an entire global database from the backup storage J 

changes, such as 'REPORT to 'REPORT! *. In general, means 101 in order to perform the search would require I 

rKDweyer. the file size (or at least some num^ 6f least 30 a download dt 16 ^ 6^ o^ta. H no effort Were made to 

sjgnrficant bfts ^ ha^ yailue on the file minimize this dyeTheadvVvhi^^ 

contents ^e required to be equal in order for a f He bly le^ than wtiat wouid b^ a ■■ 

dre^ Jh.^ database full of complete dire^bry^entrie^. (e:g.^ with 

nev^po^ |ie Jjeing backed up. In the preferred entire file names), it is still too large for an environ* 

ertibodj^nt in order .to work aroumi the : "pr^em ,, of 35 ment where dozens or hundreds of nodes on the net- 

thevsing^ work rmy be performing backup In an alternate 

attributes) in fc^m^tting the global database enfries, a embodiment, the complete m^yi m^cnch and other 

32-bit hash (ac^a&y. a CRC, WMntecnc) 439) oyeMhe fields could also be stored in each global database j 

relevant directory :#ntry information (e.g., file name, entry, slightly ^reducing both the probability of false 

time, dite, •a^d.^9) : J8-.used for comparison instead of 40 match and the search time, at a small cost in the size of ! 

theftill^ the global detab^e file 145, but such improvement are I 

value (a 32-bit CRC in the preferred em^jment) over minor at best in practice) terms. j 

the file contents is compared, as well ^ the (east signif- To minimize the data transfer overhead arid the » 

icant 1 6-toits of ttie file size. If all of these values match, search time associated with the global directory data- 

the file being backed up is considered to be a match to 45 base 145, rt is organized into two levels as shown in 

the file in the database, resulting in an false match prob- FIGURE 9, taking advantage of the effective randomize- 

ability of less than Z eo (1 0" 24 ). Clearly, the amount of tfon of search values due to the nature of a CRC func- 

matching required can be tailored to the specific error tiori, as used in (dirinfocRO 439 and <parti a iFifeCRC > 440. 

probability acceptable for any given environment (e.g.. Each entry of the first level 500, which is actually repre- 

by increasing the size of the CRCs), and such changes so sented in two structures 502 and 505, contains only a 

would still fall within the scope of the present invention. sublet of the bits of the < d frinfoCRc > 439 and {partaiFite- 
It is useful to note at this point that these various CRC , 440 fields. Each entry in the Second level 501 con- 
levels of matching files across users in the global data- tains the remaining bits needed to constitute the entire 
base are all more rigorous in general than the level of global database entry (e.g., 508. 509. 510, and 51 1). 
effort used to identify unchanged files f rbfri the previous ss and each entry also includes a 1 6-bit CRC over the sec- 
backup of the same user, in the preferred embodiment. ond-level entry to allow for a corruption check. The 
as is quite common in backup applications, the default number of bits included in the first level 500 is fixed in 
behavior is to consider a file unchanged if its file size, each global database file, although the actual number 
time, date, and name are unmodified from the previous will increase in general as the number of database 
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entries grows. The first I^e1'500;isrdbwhloaded to the 
node from .the backup storage means 101 by the 
backup program, and its contents (502 and 505) are 
used as a quick filter to limit inquiries into the (much 
larger) second level 50 1 to onl y those entries which 
have a very-high probability of being a match In the pre: 
f erred embodiment, .-to, minimize download time h ail ; 
entries in the first level 500 are packed at absi level aiid^ 
are unpacked after downloading, while entries in the 
second level are byte-aligned for simplicity. 

The entries in both levels are stored in the same 
order, sorted by the value ol <<£rintoCRC> 439/ so that 
given the index of an entry (e.g., 512) in the first level > 
500, the position of the corresponding entry (e.g., 513) 
in the second level SOtcah be easily computed. In other. 1 , 
words, the tth entry in the first level 500 corresponds 
directly to the tth entry in the second level 501, The first 
level entries are actually stored in a compressed form to 
save download time, using a counts array 502 and par- 
tial entry array 505. This simple compression is 
achieved by noting that, since the entries are sorted by 
the value of (d irintocftc) 439, the Ieao5ng,(most signifi- 
cant) bits of consecutive (drinfoCRC) values 439 will tend 
to be equals. Thus, instead of storing these leading bits 
for each entry, a counts array table 502 of M « 2 m0 
entries is included) where the value of r^ 
the Agent 108 as discussed below. The yth drray entry, 
rty, containing the number of, consecutive <dirtnfoCRC> 
entries 439 with the leading m<, bite having a value of ?/, 
as shewn at 504. For example, in FIGURE ? ( rio is 4, 
covering the first four entries in the tables 505 and 501 , 
for which the leading % Wte of ? < d j^^R C) 439 are 0; 
similarly, h 1 is 3, covering the hext three database 
entries, for which the leading hip bits of (dirinfoCRC) 439. 
interpreted as an integer, are 1 When the database is 
created, the Agent 108 chooses the value of mo based 
on the total number of global director; entries (N) in the 
file; a typical value is « 16 for N larger than 64K. 
Note that N «= E n y , where the sum is over ail values of / 
■ 0 .. M-1 . Since the values of (airinibcRC > 439 * n the file 
are effectively randomly distributed; the ny values have a 
distribution with mean N/M and a fairly small range. 
Thus, to minimize storage space further in the preferred 
embodiment, instead of storing the actual values ny , 
these values are represented in the counts array 502 by 
ny - n mini where n min is the minihium over all ny values. 
Each count can then be represented in 
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of the first level 500. - > ^ 

A concrele exajhple is the easiest way to clarify this 
simple enco^in^Sijfcpose that;we'ha^e J a total of N = 
one million database entries. If we'ehoose rno i 1 6, then \ 
M = 64K, and the average value in the counters array 
502 is N/M - 16; Suppose mat wi 
"max = 30. Then s = 5 bits, so each count entry n y is rep- 
resented in five (packed) bits by the value hj - 2, for a 
total tf^T|1S^^ 

using a coiint array, each database entry in 502 wptitcT^' 
have contained alt mo « 16 leading brts of the DrinfoCRC ) 
value 439, for a total of nearly two megabytes (1953K 
bytes); so using the count array in this case saves a total' 
of nearly 191 3K bytes In the size of the first level 500. 



bits, where n max is the maximum over all ny values. The 
values rvnin and s are computed by the Agent 1 08 when 
the global database file is created and are stored in the 
header of the global database file. In ah alternate 
embodiment, it may be possible to reduce the size of the 
counts array 502 even further using a Huffman or arith- 
metic code, but such gains would be minor because the 
counts array 502 constitutes only a small part of the size 



55 



bution; it has been observed that even for N/M as large 
as 1024, which corresponds to 64 million files in the 
database if mg = 16, well over 99.9% of all count array 
distributions can be represented by s « 8 bits or legs. In 
practice, the Agent 108 tries various values of mo to 
minimize the size of the first level 500, although it 
appears empirically that the amount of savings is not 
terribly sensitive to the choice of rho as long as it is close 
to the m 0 value that produces^ to^ 
words; simply using n^, t= 16 appears to work fairly well 
m most cages of interest 

With the counts array 502 used to represent the first 
mo bits of the (drihfocnb) value 439 very efficiently/the 
remainder of the first level 500 consists of an army 505 
of N entries; packed at a bit level. Each entry contains x 
brts of the (dirinteCRC ) value 439 (beyond the most signif- 
icant rt\> bits) arid y bits of the (pkr^&cRc > 446 the 
values x and y are chosen by the Agent 108 (and stored 
in the global database file header) based on m 0 and the 
total number of entries N in the global database 145. 
Since the entire first level 500 is downloaded, the idea is 
to trade off the size of the array 505 to minimize the 
number of accesses required into the second level 502 
to validate a match. For example, using N and M from 
the above example; if we choose x * \Q brts arid y = b 
bits, then the table 505 consists of a total of about 
1220K bytes (one million entries at 10 bits each); the 
entire first level 500 consists of about 1260 K bytes 
(1220K + 40K), as opposed to the 16M bytes required 
for a complete download of the entire database. Since 
we have rnp + x « 26 bits of (dWftfoCRC) 439 thus repre- 
sented by the first level 500. the average probability p f of 
a false second-level match based on a first-level match 
is then given roughly by p, = N/2 26 m 1/S4, assuming (as 
we are) a random distribution of (dMnbcRo values 439 
in the database. In other words, when filtering database 
inquiries at the first level, about 63 of 64 inquiries that 
match at the first level will result in matches at the sec- 
ond level also, in this example. Since every inquiry into 
the second level 501 involves a disk access into an 
entry (e.g.. 513) containing the remaining fields of the 
global database entry, it is important to minimize spuri- 
ous accesses. Typically a value of pf in the range 1/1 6 to 
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1 £56 gives a reasonable tradeoff between search per- is downloaded it is ail availab(e;local!y at the node, so - 
formance and download size For, example, If we there is never a need to access the remote backup stor- 
increase x to 1 1 bits in this example, we decrease top, , age means 101 to identify first-level matches: In the pre- 
= 1/128 at a cost of about 1 25K bytes inthe size of the ferred embodiment it ^ assumed that the entire first- 
first level. Although y = 0 in this example, the y bits of 5 level data can fit into main, memory during the search 
<partiaiRteCRC> process; were this not the case, , a virtualized (disk-, 
range as N becomes very large, or in the uhusuaJ case based) search could be designed, using well known 
where many files with the same nameftime/date/size algorithms, that might be ^nsid^ably ^ower but would 
(to. (duinfoCRC) 439) ©cist with different file contents , still achieve the same resuftVThiei preferred embodiment 
(and. thus <partiaiFiieCRc> values 440): Tne Agent 108 10 builds two arrays in memory, as shown in FIGURE 10. 
determines all these parameters at database creation , , The main array 526 has N entries, each containing the 
time based on the statistics of the entries in the data- Jr + rr^ bite (527) d < diHntoCRC) 439 and y brtsJ(32^ of 
base. In the preferred embodiment, mo + x is always at (partiaJFiteCRC y^W from the first-level entry, sorted in the 
least 16, meaning that the first ; level entries contain '.at. same order as in the first level of the global database 
least the 16 most significant bits of the ^iocrc > value f5 file. In other words, the array 526 is efectiyely a memory 
439, so that only the least significant 16 bits .of, < d [ rln . image oftr^ Contents of 505. The pointer array 520 con- . 
foCRC) 43?ne^ tO ;be ke^ level 501 at sists of T. « S* 11 entries, where m 1 is a number of bite 
508. : / „ chosen based.:^>the.:t6tal> rwmber.of-global/databaM, 
At the beginning of the backup process, the backup entries N and the amount of memory available in order 
software of the preferred embodiment loads into mem- so to optimize the search process; note that it^ may or , 
ory the first level 500 of the global directory database may not be equal to mQ. Each entry in the pointer array 
file 145, either from the backup storage means 101 or, 520 contains a pointer P* into the main array 526. The 
to irthimW n from index into the pointer array 520 is w)rrpute<|:by i ^a<rt- : 
cached copy in a directory on a disk local to the node. ing the most significant m t bite of the <d ir tnfocRc> value 
For each new/updated file to be backed up. a search is 25 439 for the f ile ihqu^ 521 points 
performed through the first level database .errtries to gee to the fir# entry in v ^ 
if there is a match. If no match is foutf at^ entry in 526. whidh ih this^m#e : ^ 
"no match* case), there is no ma^hjbg IQq ariyWh^re in which the m t bits 6f ;WirintoCRd> *3? .In question r^ye tfie 
the database, so the backup proceeds to co^ the file integer value 1. Similarly, P k 523 points to the first entry 
contents into the backup data file, which may involve 30 in 526 for which the bits of <dirinfocnc> 439 in ques- 
corhputing differences from the previous file version in tion have the integer value /(.The count of entries in 526 
the case of an updated file. If a match is found at the first to be searched for each index k is easily obtained from 
level, the corresponding second-level entry (or entries) the difference between successive pointer entries, P* +1 
is retrieved and compared; rf no match is found here, the - p k : an extra "dummy* entry P T 525, which points just 
backup proceeds as in the "no match" case just dis- - 35 past the end of the main array, is appended at the end 
cussed. The position of the corresponding second-level of the pointer array 520 in the preferred embodiment so 
entry is easily determined, as discussed above, that the same count computation can be performed for 
because its ordinal location in the second level is the : the last entry P 1A 524, without requiring any special 
same as the ordinal the associated first-level entry. If a . case logic. 

match is found at the second level, further inquiry into 40 m the preferred embodiment entries to be added to 
the backup data file containing the {fl1 eihfa) and the global directory database file 145 are extracted from 
Da ta> records 408, 436 associated with the file may be the backup data files (e.g., 144) by the Agent process 
necessary in some cases, depending on the size of the 108 as part of the migration of the backup data files from 
file (eg., *a«CRc> 409 may be needed for large files) and the \BACKUP\USER path (e.g.. 125) to the 
whether the user has enabled the "exhaustive compare" 45 \BACKUPVSYSTEM path (e.g., 129). The Agent 108 
mode, but in most cases a match to the global directory first verifies the CRC covering the u neinfoData > entries 
entry at the second level is sufficient to indicate a file 436 in the backup data fie to guarantee that ho cor- 
match. If it is ultimately determined that a complete rupted entries are added to the global directory data- 
match has occurred, the (rtolD) 214 included in the base. A new global database file 145 may then be 
«ni«Entry> record 207 of the backup directory file for this so created, consisting of the old entries merged with the 
backup is set to indicate the matching file in the global new entries. In the preferred embodiment, the new data- 
database, so no file data needs to be saved in the base file is initially created by the Agent 108 under a 
backup data file for this backup, and there is no new temporary name so that backup processes may con- 
ttteindex) 215 assigned, nor a <fjWnfo> section 408 tinue to use the current database file. Once creation of 
added. . ss the new file is completed, its name is changed to a valid 
The particular first-level search mechanism used in global directory database file name which will then be 
the preferred embodiment is very simple, and there are accessed by subsequent backup operations. In the pre- 
many other well known search techniques that could be ferred embodiment the name of global directory data- 
used. The key point here is that, after the first-level data base files have the form GDnnnnnn.GDD, where 
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nnnnnn »s a number which is incremented each time a time In an alternate embodiment, the updates could ; 

new global directory database file is added. For exam- instead be incremental in nature so that all update files 

pie, the first file would be GD000001. GOD, the second would have to be downloaded, or both incremental and 

would be GD000002.GDDi etc. Only a srftali; number. differential updates could be stored, giving more optimi- 

(typically .1 - 4) of the most recent versions of such files $ zation flexibility to the backup software tocal cache 

is retained; older versions are deleted once they are no ; logic. Once^the update list reaches a certain size (e.g., 

longer in use. Thus, for example; after some time there 10 percent of the size of the global database) or a cer- 

might be two files, GD000138.GDD and tain number of update files (e.g., 500) have been added, 

GD000139.GOD stored in the \BACKUP\SYS- the Agent 108 rebuilds an entirely new global databases 

TEM\GLOBAL directory 1 27; each time a backup oper- 10 file 145 containing all entries in the main; and update 

ation begins, the backup process will select the "latest ^ database fi1e(s). These particular settings governing 

version of the globaJ direci^y da^a^ file 145availa- howoften a newglobal database file is built are control- 

de (GP0OO139^GDD in this example). - ; , - led by the backup system admi nistrator on the Agent 

This method of completely rewriting the database node 107. Even after building a new global directory 

allows the optimized search structure discussed above is database, the Agent 108 may leave the old one(s) 

to be maintained; as.opposed to a cxxweriWl data- ^>und for a. while and may even continue adding, 

base design (e.g.. using B-Tree structures to add new updates to the old file(s), so that the local cache logic of 

entries). Fortunately, the "batch" mode of operation the. backup software may optimize its download strat-v 

inherent in backup makes such an approach acceptable *QV ^ general, only infrequently does the backup soft- 

in this application. However, once the backup system so ware heed to download to level 500 of an entire 

has been in use for a while, me rujmber d ariditional global directory database file 14S from the backup stor- 

eritries to the global database for each new backup age nreans 101. thu startup time for 

often b«i^ each backup operation, 
base size; particularly since only new and updated files 

are added to the database. For example, there might be 2$ 1 .4. Other Backup Files 
one million entries in the global database, but a hew 

backup process might add only a few dozen new FIGURE 3 shows several file types other than those 

entries. In this case, rewriting the entire global database discussed above. Most of these files are either redun- 

can be an extremely slow process, and downloading the dant (i.e.. can be regenerated from other f3es) or are 

new database after each backup could also be slow. To so ancillary at best to the present invention. A brief 

minimize such overhead, in the preferred embodiment description of the contents and uses of these foes is 

the Agent process 108 may post "update" directory files given here for completeness;. 

147 in the \BACKUP\SYSTEMVGLOBAL directory 127. As discussed previously, an index Range Lookup 

These update files 147, which are basically identical in ffle (e.g., 151) is built and maintained for each user by 

structure to the main global database 145, contain only as the Agent process 108. This file is constructed from the 

the new entries to be added to the global database. contents of the migrated backup data and backup direc- 

Since some of these update files may be quite small, tory files (e.g., 148, 149). It includes a table indicating 

the Agent 108 may choose to store them in a simplified the directory/He index ranges of each backup directory 

format with nr>o a 0. x« 16, and y a 0, so that there is no and backup data file, respectively. This file is thus 

count table 502. 40 entirely redundant arid can be thought of as a table of 

In the preferred embodiment, each update direc- contents for the backup directory/data files. Its contents 

tory, file is given a file name which links it to the associ- are organized to allow a quick binary search to deter- 

ated "base" global database file; the naming convention mine while fHe contains a given directory/file index for 

is GUxxxnnn.GDU, where nhn is the last three digits of the user, instead of having to open each file in turn to 

the base global database file name, and m is the 45 perform such a search. This file is not encrypted, 

update number. For example, the file GU003138.GDU The Backup Log file (e.g., 150) is also a redundant 

would be the third update to the base file f He, buirt and maintained by the Agent 108 for each user. 

GD000138.GDD. Since only.a few global database files It contains a copy of the "bkup Description" section of 

are retained at any time, the three digits nnn are always each of the user's backup directory files. This log file is 

sufficient in the preferred embodiment to identify the so typically used at restore time to present a list of availa- 

associated global database file unambiguously. ble backups to the user, including the annotation string 

The backup software usually maintains a simple provided by the user when the backup occurred. With- 

cache on a local disk of the last global/update directory out this file, the restore software would have to open 

file(s) downloaded from the backup storage means 101, many backup directory files to present such a list, which 

so it can speed up the database first-level download 55 could be quite slow. The contents of this file are 

process. In the preferred embodiment, each update file encrypted using the same encryption key applied to the 

contains all the updates (i.e., a differential update) to its backup directory files. 

associated main database file, so the backup process The User Account Database file 146 is maintained 

only has to download the most recent update file at any by the backup administrator software. It contains the 
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account records for all authorized backup users. In par-,: 
ticular, ct contains the list of user names (e.g., JOHN); 
user dir^ory names (e.g. , USER2), ^ ef{D > values, as .. 
well as encryption and password keys for each user, as 
will be discussed in a later section: Most of the record 
associated with each user in this file is encrypted using 
the user s private password. 

The Password Log file (e.g., 140) is used to perform ; 
changes of the user password This operation will; be 
discussed in more detail below, but this file basically 
allows each: user to post a password change "request" 
to the Agwt 108, Which; will in turn update the user's 
password^ encryption;^ 
Database and re encrypt the user's backup directory 
files (ag. ,148). 

The Previous Dir file (e:0 141) aift^ns the direc- 
tory information from the last backup operation. Its con- 
tents ane r^undart ah^ 

tha backup directory files (e.g., 148). However, unlike 
the backupi, directory filers, the Previous Dir file is not 
encrypted with a key requiring a user password. Thus, a 
backup operation can proceed at a pre-scheduied time 
(e g. midnight) without requiring the user to type in his 
password. In the (hopefully rare) event that this f He is 
lost or corrupted, it can be reconstructed, but only after 
the user enters a password. 

The User Preferences file (e.g., 142) contains user 
selected preferences, such as the values of settable 
parameters (e.g.. % No), the specification for which 
files are to be excluded from the backup, etc. 

It should be noted that all of these files in the sys- 
tem can easily be backed up to tape using any commer- 
cially available tape backup package. Because of the 
redd-only nature of most of these files, note that there is 
little , opportunity for user-induced data corruption, 
unless network security is breached. Thus, tape backup 
is relegated to a role of catastrophic faHure recovery in 
almost all cases. 

1 .5. Remote Backup 

For a mobile (e.g., notebook computer) or remote 
(e.g., home office) user, backup is often very problem- 
atic. The normal difficulties of enforcing a backup disci- 
pline are magnified, both because it is usually 
undesirable to buy or to carry around a backup device 
and because connections to a network, when available, 
are often very low speed (e.g., modem). Yet the data on 
a remote computer may be as critical as the data on any 
network node, so backup is equally important. The 
present invention provides a fairly simple but very effec- 
tive solution to thfc problem in many cASe£. 

The basic idea is to use backup over a low speed 
link to the network, relying on the duplicate file identifi- 
cation methods of the preferred embodiment to elimi- 
nate the need to send duplicate files over the Knk and on 
the (fpnhts) records 428 to identify differences between 
file versions so that only file changes are sent. Typically, 
it is desirable if possible to perform the initial backup 



when the remote computer (eg , 104) <s; connected- 
directly to the network 1 06 with a fairly high speed link. 
Otherwise, the, initial download of the first level 500 of 
the global directory file 145 and the sending.of the useK 
5 unique files that typically will not change in the future will 
make the initial backup quite slow. However; irvtne case 
where a high speed connection is not possible, the ini- 
tial backup c^ : ^ 

aOy -benefit considerably from the duplicate file : 
w identification, although it may require several hours 
Typically, subsequent backups can be performed 
rerrbtely in a matter of mi 

egies discussed throughout this specif ication are clearly 
critical to performance in this case. In addition, it is help- 

15 ful to c&cbe the (tpnnts) sections 428 from the previous 
backup on a local disk to speed up the differencing 
operationiurtfier, although^ 
ess^y ; tp:thejrefer/ecl embodiment. ; 

Remote restore operations will be slower than local 

20 access, but the time required to restore a few small files 
is quite acceptable in general. In most cases, a full 
restore over a remote low speed link is not recom- 
menced, because the duplicate file identification is of no 
help in reducing the download time in the preferred 

25 embodiment 

2. Privacy 

As has been alluded to previously, a crucial (though 

30 somewhat subtle) privacy issue arises due to the ability 
to identify and reference duplicate files across users As 
a simple example, suppose that user #1 has files A. B, 
arid C that are already saved in a backup data file and 
entered into the global directory database, arid that user 

35 #2 has files X, B, and Z. When user #2 performs a 
backup, the backup software will detect the presence of 
the duplicate file (B) using the techniques discussed 
above arid insert into the backup directory file a <f iJoID > 
record 214 referencing user #Vs file B. This is all fine; 

40 however, notice that in order to find the duplicate file B, 
user #2 effectively has access to all of user #rs data 
files, even files, A and C, which may be files that user #1 
wishes to keep private. Even assuming that the backup 
software is properly designed hot to support user #2 

45 directly in accessing to these non-duplicate files, in the 
absence of any of the prevention measures discussed 
in this section, a clever hacker could (with a significant 
reverse engineering effort) gain access to the contents 
of all of user #1 's files, which are present on the backup 

so data storage means 101 in the backup data files. This 
type of access must not be allowed in any product, such 
as the preferred embodiment, that hopes to reassure 
customers that their private data will remain private. 
This need tor privacy is not just related to personal data 

55 kept on a user's workstation; it also often, involves criti- 
cal corporate information such as the status of certain 
business negotiations, employee salaries, other person- 
nel data, etc. 
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2.1 Keeping private files private ' >f • :> : ^ 

The present invention inriudes; a simple and novel 
technique that uses enaypt'on to restrict access to a 
user's data on a fife-by-file basis; in particular, only s 
those users who in fact have (or once had) a valid copy 
of a file may reference that^partcular file The date of 
each file in thebackup set is stored in an encrypted form 
in the backup data file, where, the encryption key is 
based on a fingerprint (e.g.. CRC) of the file's data itself. io 
These r eno^bnT fingerprints themselves are then 
stored in the ^ileDecryptKey" section of the backup 
directory f II ei which is itself encrypted with a key that Is 
access^etotheuse^ 

addition^ as discussed previously; ^R^r in ; - is 

a backup data file also contain the <docryptKey) record 
427 for a referenced file, but these records are also 
encrypted, using the encryption fingerprint Of the refer- 
encing file, to prevent "indirect" access. Thus, users can 
only successfully decrypt a file's data if they have the : 20 
correct encryption fingerprint. preSurnably obtained by 
computing the f ingerprint ov^^irdWrt cdpy pf the file. 
In this way, a user has access only to encrypted ver- 
sions of the private files of other users, while at the 
same time having the ability to decrypt files which are 25 
common (and thus not private); In the preferred embod- 
iment each 64-tort encryption fingerprint is algebraically 
independent of the <fjngorPriit> values 430 and: is the 
combination of a CRC and a simple non-linear check- 
sum function over the first 256K bytes of the file con- so 
tents. One interesting property of this scheme is that if 
the "original" owner of a file forgets his password, he will 
effectively be denied access to his files, while other 
users can continue to get access to the files they share. 

It is also true that a user may wish to keep private & 
the mere existence of a certain file or a name of a file. 
Thus, in the preferred embodiment, the (voiumeOirinfo) 
section 200 of each backup directory file (e.g.. 143) is 
also encrypted with the same user-specific key as is 
used fbr the "fileDecfyptKey" section. Enough informa- 40 
tion is stored in the backup data file to allow a separate 
user to gain access to the (encrypted) data portion of a 
given file without knowing the name of the file, so unin- 
vited users are not able to "peruse" the directory/file 
tree of another user without knowing his password. 45 

2.2 "Backdoor Data Access Policy 

The need for data privacy must be balanced in the 
corporate environment with the right of the company to so 
retain access to its intellectual property (e.g., computer 
files) in the case of an employee who is unwilling or una- 
ble to produce his password. Such cases could easily 
arise when an employee forgets his password, becomes 
disgruntled, or is the victim of a disabling or fatal acci- 
dent. Typically, the current version of the user's data 
would be avaflable directly from the workstation disk, 
but there are clearly scenarios where it is critical for the 
corporation to access the user's backup data sets. 
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It was initial^ 
should allow the user alone to set his password, without 
any administrative ''back door to the backup set data. 
However, rt was dedded that such an approach does 
not give the corporation the ability to recover its data in 
any of the "disaster" scenarios merittoned:above.'Fur-^ ; 
ther. rt became quickly apparent that there were certain 
operations, such as changing-passworiis and consoli- 
dating backup sets, which became very difficult^ or 
impossible for the Agent process 108 to perform in such 
an environment. Thus, the preferred embodiment is 
designed to maintain very high security and privacy 
between users, but the backup administrator does have 
the ability to access a user's backup data if necessary. 

l :F&ka user who insists on maintaining urtimate pri^ "' 
vacyof certain personal files, there are s^^jJ ^tior^; 
although some (or all) of mese;<^ unac- 
ceptable to his manager First, he may opt not to use the 
backup software of the present invention. Second, he 
may encrypt the files in question on his local disk using 
a separate encryption utility. Third, he may exclude the 
files in question from the backup set. In the preferred 
embodiment the administrator has access to each User 
Preference file (e.g.; 142), which contains the 
exclude/include list so that an audit may be conducted 
by mara^rti^ 
not being backed up. 

2.3 Encryption Key Protocols 

In any system using encryption, careful attention 
must be paid to how the encryption keys are handled, 
and the present invention is no exception to this rule. 
This section discusses the generation and use of 
encryption keys in the preferred embodiment in order to 
insure privacy. 

As shown in FIGURE 1 1 , when a new user account 
is added by the administrator to the backup system of 
the preferred embodiment, the administrator software 
generates a user-specific random unique encryption 
key (userOirKey 541) that will be used to encrypt the 
users backup directory files. As we have seen, the "file- 
DecryptKey" section of these directory files contains the 
keys 543 (generated from fingerprint functions on the 
file contents) used to encrypt the file data 542 in the 
backup data file. The userDirKey 541 is placed in the 
User Account Database file 146, where it is encrypted 
according to a user-supplied password 540. This pass- 
word 540 may be initially supplied by the user to the 
administrator, or it may be chosen by the administrator 
and given to the user (normally with instructions to 
change it upon first use). 

The administrator software of the preferred embod- 
iment also stores the user password 540 and the user- 
55 DirKey 541 in a separate section of the User Account 
Database file 146 which is encrypted using an adminis- 
trator password.: Actually, what is stored is the encryp- 
tion key (a message digest in the preferred 
embodiment) generated from the password, not the 
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password itsetT Thus, the admmistrator has a "back 
door" path to decrypt the user's directory files rf neces- 
sary. In addition, the administrator may configure the 
Agent process 108 to change the userDirKey value 541 
from time to time and re-encrypt all user directory files $ 
to guard against the possibility that a hacker has some-, ■ 
how ob^^^cessio the user s passwrd 540 and/br 
userDirKey 541 . Although such a change may require 
some time to compete, in the preferred embodiment the 
backup directory : irtles for a user remain "on line" during w 
this operation. This is actually accomplished by storing 
two userDijf^ 

the account entry, for each user in the User. Account 
Database file 146. If a decryption checksum tails using 
the current userDirKey value, the backup software of the 1S 
preferred embodiment atrtomatically tries -the old user- . ,. 
DirKey value instead. Thus, the Agent 108 first sets the 
old userDirKey value to be the current userDirKey valye, . 
then sets the new current userDirKey and finally pro- 
ceeds to re-encrypt aJI the baickup directory files. At any so 
time during the re-encryption process, one of the two 

The user may change his password at kiiy time by 
posting a password change request in his Password 
Log file (e^g., 14d)^Thi$je^uestis encty^ed wrth^the 25 
current userDirKey value < 541 and contains ^the.. f -new. 
password. When the Agent 108 gets around to process- 
ing this request it re-encrypts the user's accourit entry 
in the User Account Database file 146 accbrdihg to the 
new password arid acknowledges the request by updat- 30 
ing his Password Log file (e.g.. 140). In the interim, the 
user may to use the new password, because a list of 
recent passwords ts maintained in the Password Log - 
file, encrypted using the latest password When the user 
needs access to the userDirKey 541 (e.g., to perform a 35 
restore), the software uses the latest password to 
access userDirKey 541 in the User Account Database 
146; upon failure, the password "history" is accessed 
and c4d passwords are tried automatically until one 
works. Thus, the user can change his password several 40 
times and continue to work without needing to wait for 
the Agent 108 to process his change request Noted 
that CRCs are embedded in these files in all cases to 
verify that the password is correct In the worst case of 
a user forgetting his password or inadvertently deleting 45 
his Password Log file while a request is pending, the 
administrator can easily issue the user a new password. 

As an administrator-configurable option in the pre- 
ferred embodiment, in order to help insure a certain 
level of security, the backup software may prompt the so 
user to change his password on a periodic basis and 
check that ail passwords have a minimum length (and . 
are not re-used). In an alternate embodiment, as an ulti- 
mate back door, it would be possible to have the admin- 
istrator software keep a log of ail user passwords and ss 
userDirKey values in a file that is encrypted using a pub- 
lic key algorithm, which only a certified third party has 
the ability to decrypt In this case, if the administrator 
loses the ability to restore passwords, the third party 



could recover the Administrator and user passwords; 
probably for a considerable fee to cover the cost of 
checking the legitimacy of the request and.to discour- 
age frivolous use of this service. ^. ( -. 

One goal of the preferred embodiment is to fellow 
the user to perftM*m backups without entering a , pass- 
word. This ability; is particularly, important in thetcom- 
mon case of performing scheduled backups. when the 
user is not present At the same time, it is cleariy ^esirr : 
able to require a password jn order to restore data. Fort 
tunatety, this feature is easily implemented as folloyvs. 
During each backup, the backup; software posts the 
backup directory file (e.g.. 143), encrypted using a spe- 
cial user-specific key (userPostKey) just for this pur- 
pose. The userPostKey value is included in the user 
account entry (which is encrypted using the user pass- 
word 540) of the User Account Database file 146; this 
key mayaJsb be stored cfi td^ disk so 

that it is available without entering a password. As part 
of the migration of the backup directory file to the 
\BACKUP\SYSTEM path 122. the Agent 108. which has 
access to bbth keys, subsequent 
using userDirKey 541. In the - pre^ed ierrixxliment, 
there is thus a brief period c^^time^frpm when vtte 
backup directory file is first posted until the Agent 1 08 
migrates it, when the system is dependent on network 
security and on the security^ of the Ich^ workstation to 
maintain theprta^ 

could in theory copy the us^l^Key from the focal 
workstation arid the backup directory file (e.g.. 143). It 
would be possible to overcome this limitation in an alter- 
nate embodiment by posting the directory file with a 
public-key encryption algorithm, using the Agent's pub- 
lic key; such an approach seems overkill, however, par- 
ticularly in light of the fact that once a hacker has access 
to the users workstation (to get the unauthorized copy 
of userPostKey), the privacy, of the backup data set is 
probably the least of anyone's concerns. 

in addition, the backup software maintains the Pre- 
vious Dir file (e.g., 141), which is also encrypted with 
userPostKey, arid can thus be accessed without a pass- 
word. This file contains a copy of all the directory infor- 
mation for the most recent backup, allowing 
identification of unchanged and modified files at the 
next backup. The software of the preferred embodiment 
may also retain a cached copy of this file on the local 
workstation to minimize network bandwidth. Note that, 
since this file does not contain the encryption finger- 
prints that are used for encrypting the file data, only a 
knowledge of directory information (as opposed to the 
file data encryption keys) would be compromised in the 
worst case if the contents of the Previous Dir file were 
somehow compromised. In the rare case where this file 
is corrupted or deleted, which can be detected by 
checking CRCs. the backup software of the preferred 
embodiment rebuilds the Previous Dir file from the pre- 
vious (encrypted) backup directory fi!e(s), although 
such rebuilding does require the user to enter his pass- 
word. 
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3. Restore Process " 



The preferred enlbc^irhent provides two principal 
ways of selecting the backup set to be restored In the 
corwentiph^ m 

of previous ^ witiv the 

backup time, date, and description (e.g., from a user's 
Backup i Lx>gfilek»ch as 150), from which he selects the 
desired backup set. In the alternate approach, the user 
selects a file from the current disk contents and is pre- 10 
sented with a fist of all previous versions of that file con- 
tained in all the backup sets. This list is typically 
presented as a selectable set of icons on a calendar 
showing when hew versions were backed up. In order to 
speed up the initial generation of this list onoe theUiser is 
hasichosen the file, in an alternate embodiment/ a 
Oastvfersion > field «s added to each sentry ) record 207 to 
provide a direct linked list of all unique versions of each 
file, as mentioned previously. 

In the preferred embodiment, there are two metrv so 
odsof restoring data from the backup storage moans - 
1 0tohce the k^c^ip set is selected The f ihsi tfehhiqufe 
is basically identical to a "conventional" restore opera- 
tion The user is presented with a tree of files available 
for restore, 25 
from the associated backup directory file. After the user 
"tags" the desired files and specifies the restore desti- 
nation, the restore software retrieves the file contents 
from th6 backup data file(s) and writes them to the des- 
tination. 30 

The second restore paradigm provides much more 
flexibility in accessing the data. Once the user selects 
the backup set, the file set is "mounted" as a read-only 
disk volume by a special file system driver. This driver is 
implemented as an installable file system <IFS) in the 35 
preferred embodiment; in an alternate embodiment, the 
disk volume is mounted using a block device driver in 
which the on-disk format of a normal disk volume is syn- 
thesized to match the contents of the backup set 
Regardless of its underlying structure, the driver pro- to 
vides all the operating system specific functions neces- 
sary to allow any application to access the files. For 
example, if the user wishes to view a spreadsheet file 
that was backed up in the associated backup set, once 
the backup set is mounted he may simply run his as 
spreadsheet program and open the file directly oh the 
mounted volume, without having first to copy the file to a 
local hard disk; alternately, the user may simply copy 
any files from the mounted volume to his local hard disk 
using his own favorite file management application. This so 
approach allows the user to access his backup data in a 
more intuitive way, using his own tools and applications, 
instead via a dedicated restore application that is unfa- 
miliar because it is rarely used. It also works around the 
common problem of inadvertently overwriting the cur- 55 
rent version of a file when restoring ah older version 
from a backup set using a conventional restore pro- 
gram. 

Observe that because the backup storage means 



101 is a random access 1 device, the time required to 
access ^any- file; is cofrpaj^ (disk access 

times! although it may require a few more seeks tofol- 
low (eWriPtr) references 420. The associated backup 
directory file is loaded^from'di^ 
backup set is chosen, after which the access to any par- 
ticular file anywhere in the backup directory tree 
involves only reading in the associated <f&jnfb) record 
408 and accessing the data blocks. Thus, a restore 
operation in the preferred embodiment is considerably 
fester in almost all cases than a comparable restore 
operation from a tape backup system; In particular, file 
access is fast enough ithat accessing; files on the 
mounted backup volume is usually imperceptibly slower 
than accessing the files on the original disk. drivel An 
alternate embodiment can take further advantage of 
this "real-time" nature of the mounted backup volume by 
adding driver software logic allowing it to be writable, in 
which all writes, actually are stored in a local transient 
cache; that may overflow onto the local disk Any writes 
to this transient cache will be discarded once the vol- 
ume is unmounted. Such an approach allows the user, 
for example, to mount a volume and perform a transient r 
"update in place* operation, such as a compilation or a 
database sort, retrieve the relevant results from the 
operation, and then unmount the volume; effectively, the 
user has temporarily taken his disk drive back in time to 
perform the update qi^tipn; 

The restore method of the preferred embodiment is 
also somewhat unique in that, although each backup 
operation after the initial backup is effectively an "incre- 
mentar backup, the image presented for restore con- 
tains all files present oh the source disk at the time of 
the backup, and all of these files are accessible in real- 
time, as discussed above. The random access nature of 
the backup storage means 101 allows only file changes 
to be stored, thus providing great savings in storage 
cost while stiH allowing for real-time access to all files. 

Another major benefit accrues to the present inven- 
tion. Note that, once the system is installed and config- 
ured, no administrator interaction is required to perform 
any backup or restore operation, other than to make 
sure that there is Sufficient free disk space on the 
backup storage means 101. Assuming that the cost of 
providing enough disk space is acceptably low, which 
seems to be the case in practice due to the high levels 
of data reduction achieved in the preferred embodiment, 
the backup system has a very low maintenance cost By 
comparison, most tape backup systems require opera- 
tor intervention to change tapes periodically arid to 
mount a tape for a given restore request: even with 
(expensive) tape or optical disk jukebox hardware, such 
operations seem almost primitive in contrast with the 
real-time nature of the present invention. 

Note that, in an alternate embodiment the present 
invention can also be applied to a single computer, in 
this case, the backup storage means 101 might be a 
section of the local hard disk, or a removable disk 
device (e.g.. Bernoulli, Syquest), or a portion of a net- 
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work disk of duplicate f tie tden 

tffication prdbably are not significant in this instance, but 
ail the othier considerable b^^its disctssed, still apply 
The Agent process 108 could be run as a background 
pw^dnthe singlac*^ 5 
the backup administrator, or the Agent Junctions ; coiM 
be ccffiigui^to rurv^ of each ...... 

backup operation , 

4. Agent Functions v \ -v- U ^ - c 10 

The Agent process 108 runs on a node 107 on the 

networkfl^ 

top PC, but it may also run as a software task on the file 
server 100. The backup administrator configures the is 
Agent process 108, both in its location and performance 
characteristics, which are quite scalable, as described, 
below. These settings may be varied over time as use of 
the backup system evolves. For example, in a backup 
system with ..; only ;:i tte#iii$e^ tb^ Agent process 1 08 so 
may run as a background task on the administrator's 
own desktop PC. As more users are added and the 
Agent process 108;r€^uir^s,rnc^itime, the adrtiinistm- 
tor may <cft to dedicate a PC on me network to run the 
Agent process 108. Eventually, it may make sense to 25 
install a backup file server dedicated: solely to backup. 
including running the Agent process 108, which then 
can access the backup storage means 101 as a local 
disk volume instead of over the network. It is fairly sim- 
ple in the preferred embodiment to change how often 30 
and where the Agent process 108 runs in order to meet 
the heeds of the backup clients. 

In addition to the migration and other Agent func- 
tions discussed above, there are some other concerns 
that must be addressed by the Agent software For 35 
example, there is a potential problem with backup "cli- 
ents" that crash during the middle of a backup opera- 
tion. Similarly, the Agent 108 itself could crash during a 
migration or consolidation (discussed below) operation. 
Both the application and Agent software are robust *o 
enough in the preferred embodiment to detect such 
conditions and respond properly, including the ability to 
"clean up their own mess" the next time they are run. A 
few other such issues are discussed briefly below. 

With a little thought, it becomes clear that there is a 45 
small problem in the preferred embodiment which, if 
ignored, might cause a backup operation to fail to iden- 
tify some duplicate files and thus slightly affect storage 
requirements, ff two users are performing backups con- 
currently (or actually if one starts before the other's so 
backup files have been migrated by the Agent 108 from 
the \6ACKUPMJSER path 121 to the \BACKUP\SYS- 
TEM path 122). neither user will be able to identify dupli- 
cate ffles from the other. This is probably of most 
concern during the "initialization" period that occurs ss 
when the first few users are running their initial backup, 
though the problem never goes away entirely, the work- 
around for this problem in the preferred embodiment is 
to have the Agent 108 perform some additional dupli- 



catis. file ^iminatipn" ;as part of the atigratio^r prbcess;; 
This can be done without modrfying the contents of the 
ba^gjfrre^ 

a duplicate file is changed to contain a single < ejC to n ptr) 
reference 420 encompassing the entire file For per 
formance reasons, this activity might actually be , 
deferred until a later time, such as the middle of the 
night, when the network should have less traffic. It is 
possible in practice that this problem simply isnt signifi- _ 
cant enough to worry about, particularty ff the adminis- 
trator "primes the pump" after installation by having a 
few representative nodes perform their initial backups 
sequentially to build up the initial global database. Thus*,;. 
in the preferred embqdiment, the ^mini^atpr can dis- 
able tHsfun^cmalityi , 

In some.- cases, a user , may wish to delete certain 
backup sets, typically to save space on the backup stor- 
age means 101. For example, the user may decide to 
merge old daily backups into weekly (or monthly) back- 
ups after a few months have passed. Because of the 
duplicate file identification and file differencing of the 
preftijredr^ resulting disk 'Sayin^Jare 

usually vf^rj^ 

backup application posts a file requesting the Agent 1 08 
to perform the deletion, which may involve consolidating 
severalbackw single backip directory/data 

file set in order to retain copies of any file ai^ directory 
entries that bre referee by the rerr^ning backup 
sets, Either of this user or other users. This consolida- 
tion operation may be best deferred until a non-busy 
time on the network Completion of *the consolidation 
operation may also have to be deferred until no users 
have a backup set mounted that contains a reference to 
the fiie(6) in questions, Observe that the use of indices 
(instead of direct pointers) both for file and directory ref- 
erences greatly simplifies such an operation; such con- 
solidation could still be performed without this extra level 
of indirection, but it would in general involve tim£-con- 
sumirig changes to many of the remaining backup files, 
instead of the creation of the single "stub" backup file 
set that results in the preferred embodiment. 

5. Administrator functions 

The backup administrator, who may or may not be a 
network administrator, has several functions to perform 
in the preferred embodiment It is intended that these 
functions be largely transparent, with most effort being 
expended at installation time, but from time to time other 
decisions and actions may become necessary. 

5.1 Installation 

The backup administrator should install the backup 
software on the Agent computer 107 (which may be his 
own desktop) and should set up the network directory 
structure (e.g.. \BACKUP\SYSTEM 122 and 
\BACKUP\USER 121) where the backup files are to be 
stored. Setting up the tnital directory structure and 
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access rights may involve som6;h^fr6m a network 
administrator, depending on the network access rights 
of the backup administrator. > 

The backup software is distrfouted on either GD- 
ROM or floppy disk, but in genera! only the backup 
admir«strator : will ever have to use the d'strfcution 
media, since the software of the preferred embodiment 
installs itself on the network in such a way to allow users 
to run a Sl^B pro^ 

an extert as ppssibi^^ automated: the 



administrator onlyhas to inform the installation software 
where the \BACKUP dtf^ctbfp20 is located, and the 
software insteUs rtserf.^ . ^ ■ 

5.2 Adding new users 

With the backup software installed/ before a user 
can actually perform ajny backups, the administrator 
must set up an ^account" for the user in the User 
Account Database 146. This is important for two rea- 
sons First, each user must have his own directories 
(e,g., 125, 129) and a unique Grinds** number which is 
crucial in identifying files that are shared across users. 
Second, keeping an account database allows the 
administrator to limit access to the system and to meter 
use of the software according to the terms of his license. 

As part of adding a new user account in the pre- 
ferred embodiment, the administrator software creates 
the user directories with appropriate access rights 
(again; this r^ 

Each user is also assigned a unique user name chosen 
by the Administrator, such as JOHN, and a unique 16- 
bit (userincfcx) (which the user never needs to know 
directly), this information, together with the unique user 
directory name (e.g., USER2), which is based on the 
(uaerindex) Instead of the user name in the preferred 
embodiment, is written to the User Account Database 
1 46, which is read-only for all users. Note that, although 
the presence of this User Account Database does allow 
a hacker to deter mine the ^rindex > and directory name 
of any user (with some reverse engineering/ since those 
two fields, while not encrypted, are not stored in the 
clear in the preferred embodiment), such knowledge 
does not compromise the privacy of the user's data in 
any way other than perhaps a knowledge of the fre- 
quency and size of backup sessions, assuming the 
user's password is not compromised The administrator 
also assigns the user an initial password which is used 
to encrypt private fields of the user's account, such as 
the userDirKey and userPostKey values. In the pre- 
Jerred embodiment, the user account entry is duplicated 
in the User Account Database 146. encrypted with the 
administrator's password, so that the keys will not be 
lost H the user forgets his password. 

the administrator next informs the user, usually via 
e-mail, that his account is now active, giving htm the 
assigned user name and (temporary) password. The 
user then runs the SETUP program from the 
\BACKUP\SYSTEM\QLOBAL directory, which under 



Microsoft Windows 3.1 may be effected:^ 
:EXE file to the e-mail: message so that the user can 
simply double-dick the • icon; The- : user enters his 
account name and password, and the software sets up 
5 a personal backup directory (typically on the user's local 
hart disty ^ 

directory. This personal directory is also used for cach- 
ing certain files, such as the Previous Dir file (e.g., 141), 
in order to minimize; network bandwidth consumption, 

10 Note that it is possible • (and probably desirable), if the 
user so chooses, to copy only a minimum set of pro- 
gram files locally, so that the user always runs the latest 
copy of the software frorri the network. Alternately, th e 
software checks its version against that on the network 

is to make surer that it is the latest and ask the user for 
permission to upgrade when a new version is detected. 

The user may also be asked to change his personal 
password during the initial installation. ■ During the 
SETUP procedure, the user will be queried to enter any; 

20 relevant personal preferences, such as how often ; : to 
schedule periodic backups and where the personal 
backup directory should be located. These preferences, 
along with the user name and < U3 edndax>; are stored in 
the User Preferences file (e:g., 142): Most preferences 

25 may be changed later. 

5.3 Overseeing the Agent 

30 without any supervision. However, circumstances may 
arise (such as a system crash) that could require some 
intervention by the administrator to restart the Agent 
process 1 08. h is intended that the Agent 108 in general 
be able to recover from most problems that arise, but it 

35 is probably not possible to guarantee complete recover- 
ability. The Agent process 108 of the preferred embodi- 
ment generates a log file of rts activities that the 
administrator can review. The administrator also has a 
monitoring application that can perform some simple 

40 checks to make Sure that the Agent 108 is performing its 
tasks oh a timely b^ giv- 
ing warnings upon observing any activity (or lack 
thereof) that appears suspect: 

The invention has been described in an exemplary 

4$ and preferred embodiment, but is not limited thereto. 
Those skilled in the art will recognize that a number of 
additional modifications and improvements can be 
made to the invention without departure from the essen- 
tial spirit and scope. The scope of the invention should 

so only be limited by the appended set of claims. 

Claims 

1 . A method for backing up data files stored on a disk 
55 volume of a node of a computer network to a 
backup storage means, said backup storage means 
containing data files already backed up from other 
nodes on said computer network, said method 
comprising the steps of: 
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Searching through a list of said f«es\already ^: 
contained in saidj>ackup storage means for a, 
match to files to i%bac^^ 

operative when no match is found between 5 
filSifete:b£te^ i 
any of said files already contained in said list 
storing ori^ 

plete representation of the contents of said file 

tovbe backed up; oonputing an>irid^;th^ indi^^ ^ 

cates the : location on vsaid backup v storage, 

m^s^crfskid^coh^ 

a^ingito 

to:b#ba<#^ 

operative when a match .is found between a file is 
to ; be backed, up; from, said disk volume arid a 

fije'^ 

index that indicates the location on said backup 
stordge^means of a complete representation of 
the contents of said file already contained in so 
said list; - ''k':^^!r *^:'A>* 

storing a data structure specifying the directory 
structure said disk ypiume at the time of the 
backup operation, said ^data ^uckire^alsQ^; 
including, for each said file backed up from said 25 
disk volume, said index indicating the location 
of said complete representation, either of said 
file to be backed up or of said file already con- 
tained in said list depending on the outcome of 
said search through said list; and 30 
whereby a ffle that is duplicated across nodes 
may be identified so that only one copy of the 
contents of said file is stored on said backup 
storage means. 

35 

2. The method of claim 1 in which the step Of storing 
said complete representation of the contents of said 
file to be backed up further includes the step of: 

operative when a previous version of said file 40 
has already been backed up from said node to 
said backup storage means, computing the dif- 
ferences from the previous version of said file, 
representing portions of the contents of said 
file to be backed up using indices into the rep- 45 
resentation of said previous version oh said 
backup storage means. 

3. The method of claim 2 in which the existence of 
said previous version of said file is detected using a so 
previously saved data structure specifying the 
directory structure of a previous backup operation, 
and in which said differences between said ver- 
sions are computed using an index, contained in 
said previously saved data structure, to a complete ss 
representation of the contents of said previous ver- 
sion of said file. 

4. The method of claim 3 in which the steps of storing 



satd complete representation olthe contents of said ; 
file to be backed up further includes the step of 
compressing portions of said representation using 
a lossless data compression algorithm before stor- 
ing said representation on said backui? storage 

n^?.^ :r-W: ..... , . •; v< . . . 

5. The niethbd of claim 4 in which the step of storing 
said complete representation of the contents of said 
file to: be backed tip further mciutidst fteiStep of 
encryptirig portions of said complete representation 
using;a hash ehcryption key that is derived by com- 
puting a hash function on the contents of said file to 
be backed up. 

6. The method of claim 5 in which said hash encryp- 
tion key is stored as part of said data*, structure 
specifying the directory structure of said disk vol- 
ume. ' ■•. . 

7. The method of claim 1 in which the step of , storing 
said complete representation of the contents of said 
file to be backed up further includes the step of 
encrypting portions of said complete representation 
using a hash encryption key that is derived from a 
computing a hash function on the contents of said 
file to be backed up. 

8. The method of claim 7 in which said hash encryp- 
tion key is stored as part of said data structure 
specifying the directory structure of said disk vol- 
ume. 

9. The method of any of claims 1 -8 in which the step 
of storing said data structure specifying said struc- 
ture of said disk volume further includes the steps 
of 

compressing portions of said data structure 
with a lossless data compression algorithm; 
and 

encrypting said hash encryption key using ah 
encryption key that is private to said node on 
said computer network. 

10. The method of any of claims 1-8 in which the said 
list of said files already contained in said backup 
storage means is organized as a database in order 
to minimize search time. 

11. The method of claim 10 in which each entry in said 
database includes a bash function computed on the 
directory entry information for the file associated 
with said entry including the file name, length, and 
time of creation, and a hash function computed 
over portions of the contents of said file. 

12. The method of claim 1 1 in which said search of said 
database includes the following steps: 
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loading a first section of said .datata^:sajd-; 
first section containing partial entries, each 
partial entry containing only a portion of an 
entry of sakJ database; v 
g^nWating a new database entry of said file to 
be backed up; and 

searching through said first section for a match 
between said new database entry and s^dpdr- ( 
tial .entries and, operative when a match is 
found between said new database entry and a 
partial entry of said first section, loading the 
remaining portions of said matching partial 
entry from the associated entry jn a second, 
section of said database, and comparing said 
new database entry with said remaining portfon; 
of.jsatt&S^ 

there is a complete match between said new J 
database entry and the complete database 

enVy> 
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1 3. The hwk>d of claifti 1 2 in Which said first se^cn of [ 
said database is stored in a sorted order based on 
bit fields of said partial entries and is compressed 
with lossless data cbrnpressibn algorithm. 

14. The method of claim 1 data 
compression algorithm includes storing an array 
indicating how many consecutive entries in said 
sorted first section havebit fields of each possible 
value of said bit fields, and in which the rest of said 
first section omits said bit fields from the remainder 
of said partial entries. 

15. The method of any of claims 1-8 in which the con- 
tents of a particular backup operation are mounted 
as a restored disk volume Having a directory struc- 
ture identical to that of the original disk volume at 
the time of said backup operation, whereby said 
files on said restored disk volume may be accessed 
from any application software that uses normal file 
system input/output calls. 

16. The method of claim 15 in which said restored disk 
volume is accessible on a read-only basis. 
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30 



40 



45 



backed up, storing on said backup storage 
means a set of hash function values computed 
on fixed si2e chunks of said previousversibns; 
at the time of said backup, loading said previ- 
ously stored hash function results; 
comparing said hash function results from said 
previous file version to hash function "results 



backed up; and 
operative when 
backed up has th_ 
of said previous file, representing sad chunk of 



19. The method of claim 18 in which said comparison 
of said hash function results includes sliding the 



in said file to be backed up may be found on any 
byte boundary in said file to be backed up, and not 
solely on chunk boundaries. 



••• : -::7>Vy?y;r... 



20. The method of claim 19 in which said hash function 
tionof a cydic r^undancy check 



(CRC).f .j : 

21. The method of claim 18 in which the contents of a 
particular backup operation jare mounted as a 
restored disk volume having a directory structure 
identical to that of the original disk volume at the 
time of said backup operation , whereby said files on 
said restored disk volume may be accessed from 
any application software that uses normal file sys- 
tem input/output calls. 

22. The method of daim 21 in which said restored disk 
volume is accessible on a read-only basis. 

23. The method of claim 22 in which said restored disk 
volume is accessible tor reads ahd Wite^U writes 
to said restored disk volume being cached in a tran- 
sient storage means, the contents of which are dis- 
carded when said restored disk volume is 
unmounted. 



17. The method of claim 15 in which said restored disk 
volume is accessible for reads and write, all writes 
to said restored disk volume being cached in a tran- 
sient storage means, the contents of which are dis- 
carded when said restored disk volume is so 
unmounted. 

18. The method of any of claims 2-8 in which the differ- 
ences between said ffle to be backed up and said 
previous version of said file are computed using a 55 
probabilistic algorithm, including the following 



at the time when said previous version was 
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